AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.73k stars 7.96k forks source link

Comparison metrics: YoloV3 vs YoloV4 #5928

Open Hyvenos opened 4 years ago

Hyvenos commented 4 years ago

I was asked to perform metrics over YoloV3 (the one from pjreddie), YoloV4 (the one from your repo) The metrics performed are the Precision, Recall and F1-score: I ran darknet detector test ... over a set of images and output the results on a file, which will be parsed and I compare the results with the one I have on my labels files, so I can count the true positives, false positives and false negatives. I first check the results over a cleaned version of COCO's 2014 validation set: some bounding boxes have been corrected and there're only 12 classes instead of 80. For some reasons I only computed the metrics for 9 of them. Here are the results (only the Total which is an average of the score for each class):

To be sure I also ran these metrics on the original COCO's 2014 validation set:

I'm surprised that YoloV4 doesn't perform much better than yoloV3, having even lower score when threshold increase. I though YoloV3 has been trained using COCO (the one we can download on pjreddie website), I assumed it was also the case for YoloV4, is it the case? I'm not aware of how YoloV4 has been trained. My feeling is that YoloV4 is less confident with its detections, thus increasing the threshold lower its recall. Another point which surprised me is that YoloV4 is more likely to outperform yoloV3 as resolution increase (the gap between them is wider in 608 than in 320) If would be grateful if someone as some explanation about these metrics!

WongKinYiu commented 4 years ago

@Hyvenos Hello,

YOLOv3 is trained with width=416, height=416, and random=1. It means YOLOv3 uses 320\~608 for training.

YOLOv4 is trained with width=512, height=512, and random=1. It means YOLOv4 uses 384\~736 for training.

So YOLOv4 never seen images with size 320, If you want to compare YOLOv3 and YOLOv4, you should use images with size 384\~608.

Or you can download YOLOv4 model which is trained with width=416, height=416, and random=1 and test the performance again. Here are cfg and weights.

And yes, due to the ground truth assignment strategy (iou_thresh), the confidence scores will be lower after apply NMS.

AlexeyAB commented 4 years ago

so I can count the true positives, false positives and false negatives.


Another point which surprised me is that YoloV4 is more likely to outperform yoloV3 as resolution increase (the gap between them is wider in 608 than in 320)


I first check the results over a cleaned version of COCO's 2014 validation set: some bounding boxes have been corrected and there're only 12 classes instead of 80. For some reasons I only computed the metrics for 9 of them. Here are the results (only the Total which is an average of the score for each class):



threshold 80 I'm not aware of how YoloV4 has been trained. My feeling is that YoloV4 is less confident with its detections, thus increasing the threshold lower its recall.

Hyvenos commented 4 years ago

Do you use the strongly modified by yourself MS COCO dataset for testing? Is it a train, minval5k, test-dev? Is it 2014 or 2017? What kind of labels did you change, how exactly did you change the labels for the iscrowd=1, why do you think your changes are correct? Which particular 12 classes did you leave, and for which particular 8 classes did you measure? Why exactly these 8 classes?

I only used it for the four first metrics I posted, the last two are used on the set found here: pjreddie's coco mirror. I used the 2014 Val images with the Yolo style labels found on the same page. For the modified version of the dataset and to answer your questions:

The optimal parameters of the confidence threshold are different for different models, so everyone uses generally accepted metrics AP or AP50, which are independent of the confidence threshold. If you will test some non-Yolo networks with -thresh 0.25 you can get very-very bad Precision values, because they use only Class-probability without Objectness. Therefore such your comparison is unfair in general. But this can help us find some kind of weak spot in Yolov4 (if it really is) if you provide more information.

Ok, I though the confidence threshold only depends on the training, not the model. The longer the network would have been trained and the more confident it will be with its detections, interesting to know it's not the case. I understand the need of a threshold-proof metrics to compare networks, however in a user point of view, Precision/Recall and F1 are easy to understand and can estimate the reliability of the network.
It's also useful to find the threshold sweatspot for each classe so we can fine tune software using it.

@AlexeyAB @WongKinYiu I computed the metrics using the yoloV4 trained in 416, here are the results on COCO's val 2014 from pjreddie website:


      Precision                     |   Recall;                                |   F1
320; 
threshold 0,25;
V4: 0.7137899536093372 |   0.6091952066453326       |   0.6573579679558884
V3: 0.6847751451323804  |   0.585218266065114        |   0.6310944915863655
threshold 0,50;
V4: 0.8574143549965157  | 0.4146932661537414       |   0.5590155320335284
V3: 0.8444987199117266  | 0.4637842990399673       |   0.5987469701597822
threshold 0,80:
V4: 0.961395283416944    |  0.349043371689249        |   0.5121470584562359
V3: 0.9465013941313708  |   0.33975965139238784   |   0.5000275563527413

416; 
threshold 0,25;
V4: 0.7217224717898366  |    0.6656362769796419    |    0.6925456873513468
V3: 0.696054533039018    |     0.5968713828555866    |   0.6426586807031949
threshold 0,50;
V4: 0.869007889556816    |    0.5575951521753932     |    0.6793124257336427
V3: 0.8344230149805326  |    0.5026962620004085     |    0.6274104902666863
threshold 0,80;
V4: 0.9608090918916222 |     0.3934397766732485     |    0.5582733326409403
V3: 0.9350402513916929 |     0.37683325389800504   |    0.5371771882529646

608; 
threshold 0,25;
V4: 0.7106426918283781 |     0.6794580241029482     |    0.6947005691014463
V3: 0.6559079204452294 |     0.6182848777830735     |    0.6365409520338012
threshold 0,50;
V4: 0.8561879297173415 |     0.5646864574113162    |     0.6805354974234418
V3: 0.6772416263970569 |     0.527687751072377      |     0.5931834968064384
threshold 0,80;
V4: 0.952515653044176  |     0.39205419758970517   |     0.5554754330833994
V3: 0.8990011183214528 |    0.40229795056853        |  0.5558542476604412

The results seems more consistent according to your answers. I choose to compute with three differents threshold so we can see that many of V4's detection have a confidence lower than x, where 0,5<=x<0,8 as we can notice a clear decrease on the score between threshold 50 and threshold 80%. But it still outperform the V3 even with a lower resolution.

What AP and AP50 do you get for these 9 classes by using CodaLab server for Yolov3 and Yolov4?

I could compute them later but as the V4 I'm testing on is the one I got from your repo, I suppose I'll get the same scores as yours as I can't compute them for only these 9 classes but only for the whole 80 it has been trained for. Or am I missing something?

AlexeyAB commented 4 years ago

I computed the metrics using the yoloV4 trained in 416, here are the results on COCO's val 2014 from pjreddie website: Precision | Recall; | F1

I could compute them later but as the V4 I'm testing on is the one I got from your repo, I suppose I'll get the same scores as yours as I can't compute them for only these 9 classes but only for the whole 80 it has been trained for. Or am I missing something?

Pycocotool and CodaLab output View scoring output log contains all metrics for each class, and then average values in the end.

We changed the following: removing all labels except: person, car, bicycle, dog, motorcycle, backpack, handbag, suitcase, bus and truck, and we added a face and profile labels.

Ok.

If you added a face and profile labels then these classes can't be compared by CodaLab.

WongKinYiu commented 4 years ago

@AlexeyAB @Hyvenos just make a table to make it can be easy comapred.

image

Hyvenos commented 4 years ago

What IoU-threshold did you use to distinguish is it TP or FP?

I always use a threshold of 50%, but I can change it if it can helps to better understand the networks' behaviour. However I wanted to avoid overloading the metrics which are already huge.

Pycocotool and CodaLab output View scoring output log contains all metrics for each class, and then average values in the end.

Ok I will take a look so i can compare with mines. I would like to say that my code is error prone, however it difficult to verify without external checks. What I can say however is that the results (inferences) had always been consistent with the metrics, so if there're errors on them it should be negligible.

@WongKinYiu Thanks you, my format is indeed barely lisible, I would attach an image next time.

AlexeyAB commented 4 years ago

What IoU-threshold did you use to distinguish is it TP or FP?

I always use a threshold of 50%

You current test shows approximately the same improvements of v4 vs v3 as AP50 on test-dev MSCOCO, for 608x608:

While in your test, delta = V4 - V3, for confidence_threshold = 0.25, 0.5:

But, for a fair comparison, we must compare:

  1. Precisions with the same Recall of two models (set different confidence_threshold for two models)
  2. Recall with the same Precisions of two models (set different confidence_threshold for two models)

But thank you for paying attention to the decreasing difference at high thrasholds (at small Recalls).

84509233-faa05d00-acf5-11ea-8c93-72006cb59bc8