AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.65k stars 7.96k forks source link

Focal loss for Yolov3 #1003

Open jiqiyang opened 6 years ago

jiqiyang commented 6 years ago

@AlexeyAB You gave an implementation of focal loss in the yolo layer (6056b83#diff-180a7a56172e12a8b79e41ec95ae569dR121).

Question 1

Is this the full code of focal loss for Yolo v3?

According to Focal loss paper (https://arxiv.org/pdf/1708.02002.pdf) Page 9 Formula (10) , (1-pt) should be raised to rth power. However, in yolo_layer.c Line 129(https://github.com/AlexeyAB/darknet/blob/master/src/yolo_layer.c), (1-pt) was not raised to 2nd power. Why is this?

Question 2

The yolov3 seems not using softmax and only use sigmoid function for multi-class classification. Your implementation for focal loss was exactly for sigmoid, right?

Thank you:)

AlexeyAB commented 6 years ago
  1. You must take the derivative dF/dx for delta. And part of expression there is already raised to a power 2.

  2. By default Yolo v2 (softmax) and v3 (sigmoid logistic) uses the same Cross-entropy. So focal-loss is the same for both cases too. Only for multi-label-training it will not work property: https://github.com/AlexeyAB/darknet/blob/17520296c730c7d7e2683452b11bf50fc8959688/src/yolo_layer.c#L115-L119

jiqiyang commented 6 years ago

@AlexeyAB

  1. So, should Line129 (https://github.com/AlexeyAB/darknet/blob/17520296c730c7d7e2683452b11bf50fc8959688/src/yolo_layer.c#L129) be changed from

    float grad = -(1 - pt) * (2 * pt*logf(pt) + pt - 1);

    to

    float grad = - pow((1 - pt), 2) * (2 * pt*logf(pt) + pt - 1);

    ?

  2. What do you mean by saying "Only for multi-label-training it will not work property "? What should I do to make it work properly?

Thanks!

AlexeyAB commented 6 years ago

So, should Line129 () be changed from to

No. Read more:

What do you mean by saying "Only for multi-label-training it will not work property "? What should I do to make it work properly?

Add focal loss for this part of code: https://github.com/AlexeyAB/darknet/blob/17520296c730c7d7e2683452b11bf50fc8959688/src/yolo_layer.c#L115-L119

kmsravindra commented 6 years ago

@AlexeyAB , Guess, the focal loss implemented by you in this repo is applicable only to classification loss?

AlexeyAB commented 6 years ago

@kmsravindra Yes. Only for classification part of Yolo-detector.

kmsravindra commented 6 years ago

@AlexeyAB , I just finished reading this excellent paper published recently and as per this, LRM ( loss rank mining) is supposed to perform well on yolo(upto 2 to 2.5% increase in mAP reported as per the paper). Is there any chance that you could implement this for yolov3? As per this paper, it gave a very clear explanation of why focal_loss is irrelevant for yolov2 (and may be for yolov3 as well - if its using similar loss functions as yolov2). Looks like LRM is supposed to be the way to go rather than focal loss for one stage detectors...

The implementation seemed to be straight forward and something like this (mentioned in 4th page of the paper) - order the losses of the final featuremap in descent order, do not back-propagate very confident ones ( back propagate losses only for less confidence ones). A mask matrix is built for the final featuremap outputs, where the corresponding elements for low confident featuremap outputs are set to '1's and corresponding to high confident featuremap outputs are set to '0's. This matrix is then multiplied element-wise with the actual featuremap outputs (essentially filtering out easy examples and preserving hard ones) and then the losses for the hardones are the only ones that are back propagated.

AlexeyAB commented 6 years ago

@kmsravindra Thanks!

kmsravindra commented 6 years ago

@AlexeyAB, Thanks...Here is my notes to your comments -

To implement LRM - should we just calculate summary loss (of x,y,w,h,objectnes and each classes) for each anchor in the final feature map (for Yolo v2 416x416 - will be 13x13x5=845 values), then sort these 845 values, and set zero for each after K-th value? (where is K=64 or 128 or 256)

Yes. This was my understanding too. But first we need to group these 845 values into different classes that the predictions represent and then sort by loss value within each class followed by zero setting for each group after the K-th value ( That is what I understood from the following sentence in the paper - "Reorganize and group output elements in the final feature map according to which predictions that they represent") . This K value is the first hyper-parameter.

Also, should we use NMS during training, and how should we use it?

The paper didn't talk about the exact NMS approach but it was hinted several places later. I too didn't understand and it was not very clear if the new NMS approach mentioned in this paper - if this is in addition to OR in-lieu of the existing NMS approach already being used by YOLO... So, as per my understanding,

  1. this new NMS is applied in addition to the existing NMS
  2. Done to get the highest loss box and not to get the highest confidence score - that too only during the training phase after the prediction ranking so as to suppress multiple boxes of the same class from the same region - Only the highest loss valued box is selected from these and remaining are suppressed for hard training ( If my understanding is correct, then this new NMS is doing the opposite of the usual NMS that is typically done to select the high confidence class). The reasoning for my thinking is - (ref from paper - "non-maximum suppression(NMS) [2] is used after all predictions being ranked, as co-located predictions with high Intersection-over-Union(IoU) serve similar functions during backpropagation and selecting them as hard examples for multiple times is meaningless."). Does that make sense?

Another useful reference on NMS from the paper - "The first hyperparameter is the number of hard examples. This hyperparameter represents the number of predictions remained for backpropagation. The other hyperparameter is the NMS threshold, which is the limitation we put on the final predictions. The NMS method is used to remove some redundant informations. More specifically, when the IoU(Intersection-over-Union) of two predictions belonging to same classes is equal or lager than the threshold, only the one with higher loss values should be remained. When use smaller NMS threshold, stricter limitation is introduced and more redundant informations are removed."

Apart from these two references, I didn't find much information on the exact NMS approach.

kmsravindra commented 6 years ago

@AlexeyAB, Any luck if this will be implemented soon?

AlexeyAB commented 6 years ago

@kmsravindra I think not soon. I do not see a way to do it quickly. There are several incomprehensibilities and uncertainties that will take a lot of time for trials and mistakes.

But I am now busy with other tasks.

kmsravindra commented 6 years ago

@AlexeyAB , Sure, thanks

BretGreat commented 5 years ago

@AlexeyAB I am very grateful for the work you have done. Now I want to add LRM to Yolo v3 for trademark detection. It's urgent for me. If you have any progress on the implementation of LRM, I wish you could help me. Thank you very much.

BretGreat commented 5 years ago

@kmsravindra I've read these comments of you and AlexeyAB. It's awesome. I'm also intested in the implementation of LRM, if you have any progress, I wish you could help me. Because I want to add LRM to Yolo v3 for trademark detection. Best wishes.