vishnubanna commented 3 years ago

@AlexeyAB and @WongKinYiu, I have noticed that not many people been able to replicate the results of this loss function from scratch outside of the darknet repo using the exact same loss function.

Currently I am working a re-implementation of all the yolo model in tensor-flow and we are facing this issue. We are able to load in the check points trained in darknet and verify the results as expected, but we are not able to reach the same results from scratch with our current understanding of the loss function from this paper. So if you don't mind, I will link the code we have and explain our current understanding and it would be really helpful if you could point out where we have gone astray. And it would also be helpful if you could give some tips for replicating the paper results on COCO2017 and COCO2014

1) due to Tensorflow's dual execution modes, the use of loops has to be limited for us in order for Yolo to retain its speed, thus we need to fined work arounds

Objectness

as it stands it is my understanding that you guys iterate all pixels in the output related to the objectness and compute the negative (BCE) delta update for all locations. You also iterate all the ground truth boxes and apply a search for potential boxes that a cell may have predicted despite not being explicitly told to do so (ignore threshold). If a box is found with both the class matching and the iou being larger than the ignore threshold, then the objectness loss for this cell is set to 0.0. You also have the truth threshold, but it seems that value is no used for COCO. Another addition i noticed is that the loss is also set to 0.0 if the model predicts a value of NAN.

next you iterate all the boxes, and use the x,y index of the ground truth to index the models output and selectively apply the positive BCE loss only to the model cells where an object should have been predicted a box.

so you have 4 cases: 1) no box should be predicted, and no bx should have been predicted -> apply negative BCE update 2) no box should be predicted, but a box was predicted, and it matches nothing in the ground truth -> apply negative BCE update 3) no box should be predicted, but a box was predicted, and it matches something in the ground truth -> no update 4) box should be predicted -> regardless of ignore threshold you must apply the positive BCE update

Classification

for only the locations where box exists, we apply the BCE loss for the classes at those locations, all other gird cells are ignored

Boxes

for only locations were a box exists, we apply the IOU loss or the MSE loss

finally, for a non delta function, but the actual loss function, all the losses are summed together for each batch, and the loss is averaged across batches.

I have implemented both the loss from Ultralytics and the Loss from Darknet (as closely as possible) and i can link lines for each section in our code

Objectness Ignore thresholding Search

https://github.com/PurdueCAM2Project/TensorFlowModels/blob/7e96fab3954de3843b8beaff9dc9d633fbc3b30c/yolo/losses/yolo_loss.py#L456

iterates the boxes and classes in chunks and computes the mask of locations where the predicted box matches any of the ground truth boxes in dependent of assignment. if the iou > iou_threshold, the loss for objectness is ignored
also preserve the ground truth grid for location where boxes should exist and the cell cannot be ignored

Objectness Loss:

https://github.com/PurdueCAM2Project/TensorFlowModels/blob/7e96fab3954de3843b8beaff9dc9d633fbc3b30c/yolo/losses/yolo_loss.py#L517

bce = ks.losses.binary_crossentropy(
        K.expand_dims(true_conf, axis=-1), pred_conf, from_logits=True)
conf_loss = math_ops.mul_no_nan(obj_mask, bce)
conf_loss = math_ops.rm_nan_inf(conf_loss, val = 0.0)
conf_loss = tf.cast(
          tf.reduce_sum(conf_loss, axis=(1, 2, 3)), dtype=y_pred.dtype)

compute the BCE loss along the anchor axis, and multiply the mask generated by the grid search in order to only keep the loss for grid locations where a box should exist or search has indicated that the predicted boxes iou < ignore_threshold for any of the ground truth boxes.

Class Loss

https://github.com/PurdueCAM2Project/TensorFlowModels/blob/7e96fab3954de3843b8beaff9dc9d633fbc3b30c/yolo/losses/yolo_loss.py#L506

# build the ground truth grid for BCE
true_class = self.build_grid(inds, true_class, pred_class, ind_mask)

# compute the number of class in a cell used for box loss accumulation 
counts = true_class
      counts = tf.reduce_sum(counts, axis=-1, keepdims=True)
reps = tf.gather_nd(counts, inds, batch_dims=1)

# compute the loss
class_loss = ks.losses.binary_crossentropy(
          K.expand_dims(true_class, axis=-1),
          K.expand_dims(pred_class, axis=-1),
          label_smoothing=self._label_smoothing,
          from_logits=True)
class_loss = tf.reduce_sum(class_loss, axis=-1)

# keep only the loss for cells where an object should exist
class_loss = math_ops.mul_no_nan(grid_mask, class_loss)
class_loss = math_ops.rm_nan_inf(class_loss, val = 0.0)

# accumulate loss for this sample via sum 
class_loss = tf.cast(
    tf.reduce_sum(class_loss, axis=(1, 2, 3)), dtype=y_pred.dtype)

Box loss:

https://github.com/PurdueCAM2Project/TensorFlowModels/blob/7e96fab3954de3843b8beaff9dc9d633fbc3b30c/yolo/losses/yolo_loss.py#L474

# select boxes from prediction
pred_box = math_ops.mul_no_nan(ind_mask,
                                   tf.gather_nd(pred_box, inds, batch_dims=1))
# compute loss
iou, liou, box_loss = self.box_loss(true_box, pred_box)

# remove loss from padded indexes 
box_loss = math_ops.mul_no_nan(tf.squeeze(ind_mask, axis=-1), box_loss)

# accumulate results of iou_thresh > 0.213
box_loss = math_ops.divide_no_nan(box_loss, reps)

# sum of loss over all objects
box_loss = tf.cast(tf.reduce_sum(box_loss, axis=1), dtype=y_pred.dtype)

final loss aggregation:

# apply loss weights
box_loss *= self._iou_normalizer
class_loss *= self._cls_normalizer
conf_loss *= self._obj_normalizer

# apply sum all losses
loss = box_loss + class_loss + conf_loss

# average of loss over all batches
loss = tf.reduce_mean(loss)

Observations at resolution 512x512

using this loss that I outlined above the AP50 of the model does not reach the described 65% from the paper using the same pre processing ops with the exception of resizing the network every 10 batches.
using the coco evaluator the highest AP50 reached is 54%
compared to the ultralytics loss whose AP50 was 58-59%
i noticed the ultralystics repo uses a more aggressive scale_xy computation where it places distictly the boxes in the cells next to (i,j) if cell can be reached using offset prescribed by scale_xy. This leads to far more ground truth boxes being placed and leads to an increase in AP50 and MAP, if this is done in the darkent repo i am not sure where
they also use a larger scale_xy at scale_xy = 2.0, which again leads to far more ground truth boxes being placed in the ground truth grid
replicating the behavior with the darknet yolo loss with a scale_xy = 2.0 leads to a 7% increase in AP50 to 57%
replicating this behavior with scale_xy = 1.05, 1.1, and 1.2 for scale p3, p4, p5 receptively leads to 2% increase in AP50 to 56%.
the highest AP50 I was able to get is 59% found using the ultralytics Loss.

visualization of the darknet objectness grid without ultralytics alterations: (scale_xy = same ay yolov4)

sample no mosaic

sample with mosaic

visualization of the grid with the ultralytics alterations: (scale_xy = 2.0)

sample no mosaic sample with mosaic

let me know if there is any other details that I can provide. Also, my apologies for grammar errors

vishnubanna commented 3 years ago

Note A: it seem that the sigmoid applied to the boxes is not back propagated as a part of the delta, and the box decoding from model output to actual output is also not propagated back wards. so sigmoid, application of scale_xy, and box decoding is not propagated.

the sigmoid is still up in the air, i may be wrong, but propagating the box decoding as most auto diff libraries would do, leads to the gradient that is propagating back being scaled by both the size of the anchor and the size of the output. it makes the gradients really small and causes overfiting sooner. currently i am testing this.

dropping the gradient compute for all these aspects lead to a 1.5% increase in AP50 at only 33% of training completed, so about from 59% to 60.5%. if the model does not over fit, when the first drop in Learning rate occurs i can expect to see that AP50 increase, so that is at about 80% of the training completed.

previously the model would start to overfit at about 32% of training completed, so the fact that it has not yet over fit is slightly positive.

@AlexeyAB if you could comment on the sigmoid and where in the yolo layer its gradient is acknowledged that would be helpful. Also it seems that for the CSP model the sigmoid for classes and objectness is akcnowledged 2 times as the sigmoid operation seems to be build into the delta function of the yolo_layer.c

bulatnv commented 3 years ago

@vishnubanna

Hello Vishnu.

I'm also interested on tensorflow implementation of yolo (Yolov4, CSP-Yolo). Is your project open? Can someone participate too?

Looking forward to hear from you. Kind regurds. Bulat.

AlexeyAB / darknet

Replication of Yolo Loss outside of Darknet #7580