Comparison metrics: YoloV3 vs YoloV4

Hyvenos commented 4 years ago

I was asked to perform metrics over YoloV3 (the one from pjreddie), YoloV4 (the one from your repo) The metrics performed are the Precision, Recall and F1-score: I ran darknet detector test ... over a set of images and output the results on a file, which will be parsed and I compare the results with the one I have on my labels files, so I can count the true positives, false positives and false negatives. I first check the results over a cleaned version of COCO's 2014 validation set: some bounding boxes have been corrected and there're only 12 classes instead of 80. For some reasons I only computed the metrics for 9 of them. Here are the results (only the Total which is an average of the score for each class):

608, threshold 25 Precision | Recall | F1-score V3: 0,61549185982171 | 0,671530201992737 | 0,621010022501951 V4: 0,663202060693532 | 0,736413819590893 | 0,687925807514964
608, threshold 80 Precision | Recall | F1-score V3: 0,868105009571517 | 0,436010148612272 | 0,537643580019121 V4: 0,901355203221005 | 0,424318766470518 | 0,549078141383253
320, threshold 25 Precision | Recall | F1-score V3: 0,640359897272839 | 0,587524655365978 | 0,595209573385479 V4: 0,6430859793567 | 0,603819317535369 | 0,611881290125521
320, threshold 80 Precision | Recall | F1-score V3: 0,893357616827859 | 0,365742855878961 | 0,48084793618182 V4: 0,897015515037863 | 0,337629701197596 | 0,459542788643878

To be sure I also ran these metrics on the original COCO's 2014 validation set:

320, threshold 25 Precision | Recall | F1-score V3: 0.6847751451323804 | 0.585218266065114 | 0.6310944915863655 V4: 0.7189399898309946 | 0.563205555933819 | 0.6316147148712133
320, threshold 80 Precision | Recall | F1-score V3: 0.9465013941313708 | 0.33975965139238784 | 0.5000275563527413 V4: 0.962967383520323 | 0.3021345407503234 | 0.459956154671393

I'm surprised that YoloV4 doesn't perform much better than yoloV3, having even lower score when threshold increase. I though YoloV3 has been trained using COCO (the one we can download on pjreddie website), I assumed it was also the case for YoloV4, is it the case? I'm not aware of how YoloV4 has been trained. My feeling is that YoloV4 is less confident with its detections, thus increasing the threshold lower its recall. Another point which surprised me is that YoloV4 is more likely to outperform yoloV3 as resolution increase (the gap between them is wider in 608 than in 320) If would be grateful if someone as some explanation about these metrics!

WongKinYiu commented 4 years ago

@Hyvenos Hello,

YOLOv3 is trained with width=416, height=416, and random=1. It means YOLOv3 uses 320\~608 for training.

YOLOv4 is trained with width=512, height=512, and random=1. It means YOLOv4 uses 384\~736 for training.

So YOLOv4 never seen images with size 320, If you want to compare YOLOv3 and YOLOv4, you should use images with size 384\~608.

Or you can download YOLOv4 model which is trained with width=416, height=416, and random=1 and test the performance again. Here are cfg and weights.

And yes, due to the ground truth assignment strategy (iou_thresh), the confidence scores will be lower after apply NMS.

AlexeyAB commented 4 years ago

so I can count the true positives, false positives and false negatives.

It is generally accepted to use a test-dev CodaLab-server https://competitions.codalab.org/competitions/20794#participate to test models on Microsoft COCO dataset, instead of your own scripts, to avoid gross errors and manipulations https://github.com/AlexeyAB/darknet/wiki/How-to-evaluate-accuracy-and-speed-of-YOLOv4

Another point which surprised me is that YoloV4 is more likely to outperform yoloV3 as resolution increase (the gap between them is wider in 608 than in 320)

default weights-file of YOLOv4 was traied with 384-736 network resolution, while YOLOv3 with 320-608, so the the optimal network resolution is 384-736 for YOLOv4, and 320-608 for YOLOv3.

I first check the results over a cleaned version of COCO's 2014 validation set: some bounding boxes have been corrected and there're only 12 classes instead of 80. For some reasons I only computed the metrics for 9 of them. Here are the results (only the Total which is an average of the score for each class):

Do you use the strongly modified by yourself MS COCO dataset for testing? Is it a train, minval5k, test-dev? Is it 2014 or 2017? What kind of labels did you change, how exactly did you change the labels for the iscrowd=1, why do you think your changes are correct? Which particular 12 classes did you leave, and for which particular 8 classes did you measure? Why exactly these 8 classes?

What AP and AP50 do you get for these 9 classes by using CodaLab server for Yolov3 and Yolov4?

threshold 80 I'm not aware of how YoloV4 has been trained. My feeling is that YoloV4 is less confident with its detections, thus increasing the threshold lower its recall.

The optimal parameters of the confidence threshold are different for different models, so everyone uses generally accepted metrics AP or AP50, which are independent of the confidence threshold. If you will test some non-Yolo networks with -thresh 0.25 you can get very-very bad Precision values, because they use only Class-probability without Objectness. Therefore such your comparison is unfair in general. But this can help us find some kind of weak spot in Yolov4 (if it really is) if you provide more information.

Hyvenos commented 4 years ago

Do you use the strongly modified by yourself MS COCO dataset for testing? Is it a train, minval5k, test-dev? Is it 2014 or 2017? What kind of labels did you change, how exactly did you change the labels for the iscrowd=1, why do you think your changes are correct? Which particular 12 classes did you leave, and for which particular 8 classes did you measure? Why exactly these 8 classes?

I only used it for the four first metrics I posted, the last two are used on the set found here: pjreddie's coco mirror. I used the 2014 Val images with the Yolo style labels found on the same page. For the modified version of the dataset and to answer your questions:

I took into account 9 classes not 8, I made a mistake in my previous post
This is a modified version of the same val set than above (val 2014) , the only point using it at first was that I didn't have the original one on my server, I didn't expected yolov3 and yolov4 to show their best performance on it because of the modifications but I though it would be enough to compare them.
We changed the following: removing all labels except: person, car, bicycle, dog, motorcycle, backpack, handbag, suitcase, bus and truck, and we added a face and profile labels.
We changed some bounding boxes so they can be more accurate (less blank space around the object), remove undesirable boudingbox like arms labeled as person, or group of persons labeled as one person (same for cars). I don't pretend that our corrections are correct nor better, they're just more suitable for our usecases when training our yolos on it.
When I say I measured only for 9 classes, I mean ran the detections with all the classes, and I only took into account these 9 classes when computing the average of the scores. I skipped the two classes we added as yoloV3 and yoloV4 don't know them and the motorcycle because we have a mismatch between the name in COCO.names and the one in our .names we use for our 12 classes' Yolo.

The optimal parameters of the confidence threshold are different for different models, so everyone uses generally accepted metrics AP or AP50, which are independent of the confidence threshold. If you will test some non-Yolo networks with -thresh 0.25 you can get very-very bad Precision values, because they use only Class-probability without Objectness. Therefore such your comparison is unfair in general. But this can help us find some kind of weak spot in Yolov4 (if it really is) if you provide more information.

Ok, I though the confidence threshold only depends on the training, not the model. The longer the network would have been trained and the more confident it will be with its detections, interesting to know it's not the case. I understand the need of a threshold-proof metrics to compare networks, however in a user point of view, Precision/Recall and F1 are easy to understand and can estimate the reliability of the network.
It's also useful to find the threshold sweatspot for each classe so we can fine tune software using it.

@AlexeyAB @WongKinYiu I computed the metrics using the yoloV4 trained in 416, here are the results on COCO's val 2014 from pjreddie website:


      Precision                     |   Recall;                                |   F1
320; 
threshold 0,25;
V4: 0.7137899536093372 |   0.6091952066453326       |   0.6573579679558884
V3: 0.6847751451323804  |   0.585218266065114        |   0.6310944915863655
threshold 0,50;
V4: 0.8574143549965157  | 0.4146932661537414       |   0.5590155320335284
V3: 0.8444987199117266  | 0.4637842990399673       |   0.5987469701597822
threshold 0,80:
V4: 0.961395283416944    |  0.349043371689249        |   0.5121470584562359
V3: 0.9465013941313708  |   0.33975965139238784   |   0.5000275563527413

416; 
threshold 0,25;
V4: 0.7217224717898366  |    0.6656362769796419    |    0.6925456873513468
V3: 0.696054533039018    |     0.5968713828555866    |   0.6426586807031949
threshold 0,50;
V4: 0.869007889556816    |    0.5575951521753932     |    0.6793124257336427
V3: 0.8344230149805326  |    0.5026962620004085     |    0.6274104902666863
threshold 0,80;
V4: 0.9608090918916222 |     0.3934397766732485     |    0.5582733326409403
V3: 0.9350402513916929 |     0.37683325389800504   |    0.5371771882529646

608; 
threshold 0,25;
V4: 0.7106426918283781 |     0.6794580241029482     |    0.6947005691014463
V3: 0.6559079204452294 |     0.6182848777830735     |    0.6365409520338012
threshold 0,50;
V4: 0.8561879297173415 |     0.5646864574113162    |     0.6805354974234418
V3: 0.6772416263970569 |     0.527687751072377      |     0.5931834968064384
threshold 0,80;
V4: 0.952515653044176  |     0.39205419758970517   |     0.5554754330833994
V3: 0.8990011183214528 |    0.40229795056853        |  0.5558542476604412

The results seems more consistent according to your answers. I choose to compute with three differents threshold so we can see that many of V4's detection have a confidence lower than x, where 0,5<=x<0,8 as we can notice a clear decrease on the score between threshold 50 and threshold 80%. But it still outperform the V3 even with a lower resolution.

What AP and AP50 do you get for these 9 classes by using CodaLab server for Yolov3 and Yolov4?

I could compute them later but as the V4 I'm testing on is the one I got from your repo, I suppose I'll get the same scores as yours as I can't compute them for only these 9 classes but only for the whole 80 it has been trained for. Or am I missing something?

AlexeyAB commented 4 years ago

I computed the metrics using the yoloV4 trained in 416, here are the results on COCO's val 2014 from pjreddie website: Precision | Recall; | F1

What IoU-threshold did you use to distinguish is it TP or FP?
Are you sure you have no errors in your own code for accuracy evaluation?

I could compute them later but as the V4 I'm testing on is the one I got from your repo, I suppose I'll get the same scores as yours as I can't compute them for only these 9 classes but only for the whole 80 it has been trained for. Or am I missing something?

Pycocotool and CodaLab output View scoring output log contains all metrics for each class, and then average values in the end.

We changed the following: removing all labels except: person, car, bicycle, dog, motorcycle, backpack, handbag, suitcase, bus and truck, and we added a face and profile labels.

Ok.

If you added a face and profile labels then these classes can't be compared by CodaLab.

WongKinYiu commented 4 years ago

@AlexeyAB @Hyvenos just make a table to make it can be easy comapred.

Hyvenos commented 4 years ago

What IoU-threshold did you use to distinguish is it TP or FP?

I always use a threshold of 50%, but I can change it if it can helps to better understand the networks' behaviour. However I wanted to avoid overloading the metrics which are already huge.

Pycocotool and CodaLab output View scoring output log contains all metrics for each class, and then average values in the end.

Ok I will take a look so i can compare with mines. I would like to say that my code is error prone, however it difficult to verify without external checks. What I can say however is that the results (inferences) had always been consistent with the metrics, so if there're errors on them it should be negligible.

@WongKinYiu Thanks you, my format is indeed barely lisible, I would attach an image next time.

AlexeyAB commented 4 years ago

What IoU-threshold did you use to distinguish is it TP or FP?

I always use a threshold of 50%

You current test shows approximately the same improvements of v4 vs v3 as AP50 on test-dev MSCOCO, for 608x608:

V3 - 33.0% AP - 57.9% AP50
V4 - 43.5% AP (+10.5) - 65.7% AP50 (+7.8)

While in your test, delta = V4 - V3, for confidence_threshold = 0.25, 0.5:

Precision: +5.5, +17.9
Recall: +6.1, +3.6
F1: +5.9, +8.8

But, for a fair comparison, we must compare:

Precisions with the same Recall of two models (set different confidence_threshold for two models)
Recall with the same Precisions of two models (set different confidence_threshold for two models)

But thank you for paying attention to the decreasing difference at high thrasholds (at small Recalls).

84509233-faa05d00-acf5-11ea-8c93-72006cb59bc8

AlexeyAB / darknet

Comparison metrics: YoloV3 vs YoloV4 #5928