Open Hyvenos opened 4 years ago
@Hyvenos Hello,
YOLOv3 is trained with width=416
, height=416
, and random=1
.
It means YOLOv3 uses 320\~608 for training.
YOLOv4 is trained with width=512
, height=512
, and random=1
.
It means YOLOv4 uses 384\~736 for training.
So YOLOv4 never seen images with size 320, If you want to compare YOLOv3 and YOLOv4, you should use images with size 384\~608.
Or you can download YOLOv4 model which is trained with width=416
, height=416
, and random=1
and test the performance again.
Here are cfg and weights.
And yes, due to the ground truth assignment strategy (iou_thresh), the confidence scores will be lower after apply NMS.
so I can count the true positives, false positives and false negatives.
Another point which surprised me is that YoloV4 is more likely to outperform yoloV3 as resolution increase (the gap between them is wider in 608 than in 320)
I first check the results over a cleaned version of COCO's 2014 validation set: some bounding boxes have been corrected and there're only 12 classes instead of 80. For some reasons I only computed the metrics for 9 of them. Here are the results (only the Total which is an average of the score for each class):
threshold 80 I'm not aware of how YoloV4 has been trained. My feeling is that YoloV4 is less confident with its detections, thus increasing the threshold lower its recall.
-thresh 0.25
you can get very-very bad Precision values, because they use only Class-probability without Objectness. Therefore such your comparison is unfair in general. But this can help us find some kind of weak spot in Yolov4 (if it really is) if you provide more information.Do you use the strongly modified by yourself MS COCO dataset for testing? Is it a train, minval5k, test-dev? Is it 2014 or 2017? What kind of labels did you change, how exactly did you change the labels for the iscrowd=1, why do you think your changes are correct? Which particular 12 classes did you leave, and for which particular 8 classes did you measure? Why exactly these 8 classes?
I only used it for the four first metrics I posted, the last two are used on the set found here: pjreddie's coco mirror. I used the 2014 Val images with the Yolo style labels found on the same page. For the modified version of the dataset and to answer your questions:
The optimal parameters of the confidence threshold are different for different models, so everyone uses generally accepted metrics AP or AP50, which are independent of the confidence threshold. If you will test some non-Yolo networks with -thresh 0.25 you can get very-very bad Precision values, because they use only Class-probability without Objectness. Therefore such your comparison is unfair in general. But this can help us find some kind of weak spot in Yolov4 (if it really is) if you provide more information.
Ok, I though the confidence threshold only depends on the training, not the model. The longer the network would have been trained and the more confident it will be with its detections, interesting to know it's not the case.
I understand the need of a threshold-proof metrics to compare networks, however in a user point of view, Precision/Recall and F1 are easy to understand and can estimate the reliability of the network.
It's also useful to find the threshold sweatspot for each classe so we can fine tune software using it.
@AlexeyAB @WongKinYiu I computed the metrics using the yoloV4 trained in 416, here are the results on COCO's val 2014 from pjreddie website:
Precision | Recall; | F1
320;
threshold 0,25;
V4: 0.7137899536093372 | 0.6091952066453326 | 0.6573579679558884
V3: 0.6847751451323804 | 0.585218266065114 | 0.6310944915863655
threshold 0,50;
V4: 0.8574143549965157 | 0.4146932661537414 | 0.5590155320335284
V3: 0.8444987199117266 | 0.4637842990399673 | 0.5987469701597822
threshold 0,80:
V4: 0.961395283416944 | 0.349043371689249 | 0.5121470584562359
V3: 0.9465013941313708 | 0.33975965139238784 | 0.5000275563527413
416;
threshold 0,25;
V4: 0.7217224717898366 | 0.6656362769796419 | 0.6925456873513468
V3: 0.696054533039018 | 0.5968713828555866 | 0.6426586807031949
threshold 0,50;
V4: 0.869007889556816 | 0.5575951521753932 | 0.6793124257336427
V3: 0.8344230149805326 | 0.5026962620004085 | 0.6274104902666863
threshold 0,80;
V4: 0.9608090918916222 | 0.3934397766732485 | 0.5582733326409403
V3: 0.9350402513916929 | 0.37683325389800504 | 0.5371771882529646
608;
threshold 0,25;
V4: 0.7106426918283781 | 0.6794580241029482 | 0.6947005691014463
V3: 0.6559079204452294 | 0.6182848777830735 | 0.6365409520338012
threshold 0,50;
V4: 0.8561879297173415 | 0.5646864574113162 | 0.6805354974234418
V3: 0.6772416263970569 | 0.527687751072377 | 0.5931834968064384
threshold 0,80;
V4: 0.952515653044176 | 0.39205419758970517 | 0.5554754330833994
V3: 0.8990011183214528 | 0.40229795056853 | 0.5558542476604412
The results seems more consistent according to your answers. I choose to compute with three differents threshold so we can see that many of V4's detection have a confidence lower than x, where 0,5<=x<0,8 as we can notice a clear decrease on the score between threshold 50 and threshold 80%. But it still outperform the V3 even with a lower resolution.
What AP and AP50 do you get for these 9 classes by using CodaLab server for Yolov3 and Yolov4?
I could compute them later but as the V4 I'm testing on is the one I got from your repo, I suppose I'll get the same scores as yours as I can't compute them for only these 9 classes but only for the whole 80 it has been trained for. Or am I missing something?
I computed the metrics using the yoloV4 trained in 416, here are the results on COCO's val 2014 from pjreddie website: Precision | Recall; | F1
What IoU-threshold did you use to distinguish is it TP or FP?
Are you sure you have no errors in your own code for accuracy evaluation?
I could compute them later but as the V4 I'm testing on is the one I got from your repo, I suppose I'll get the same scores as yours as I can't compute them for only these 9 classes but only for the whole 80 it has been trained for. Or am I missing something?
Pycocotool and CodaLab output View scoring output log
contains all metrics for each class, and then average values in the end.
We changed the following: removing all labels except: person, car, bicycle, dog, motorcycle, backpack, handbag, suitcase, bus and truck, and we added a face and profile labels.
Ok.
If you added a face and profile labels
then these classes can't be compared by CodaLab.
@AlexeyAB @Hyvenos just make a table to make it can be easy comapred.
What IoU-threshold did you use to distinguish is it TP or FP?
I always use a threshold of 50%, but I can change it if it can helps to better understand the networks' behaviour. However I wanted to avoid overloading the metrics which are already huge.
Pycocotool and CodaLab output View scoring output log contains all metrics for each class, and then average values in the end.
Ok I will take a look so i can compare with mines. I would like to say that my code is error prone, however it difficult to verify without external checks. What I can say however is that the results (inferences) had always been consistent with the metrics, so if there're errors on them it should be negligible.
@WongKinYiu Thanks you, my format is indeed barely lisible, I would attach an image next time.
What IoU-threshold did you use to distinguish is it TP or FP?
I always use a threshold of 50%
You current test shows approximately the same improvements of v4 vs v3 as AP50 on test-dev MSCOCO, for 608x608:
While in your test, delta = V4 - V3, for confidence_threshold = 0.25, 0.5:
But, for a fair comparison, we must compare:
But thank you for paying attention to the decreasing difference at high thrasholds (at small Recalls).
I was asked to perform metrics over YoloV3 (the one from pjreddie), YoloV4 (the one from your repo) The metrics performed are the Precision, Recall and F1-score: I ran
darknet detector test ...
over a set of images and output the results on a file, which will be parsed and I compare the results with the one I have on my labels files, so I can count the true positives, false positives and false negatives. I first check the results over a cleaned version of COCO's 2014 validation set: some bounding boxes have been corrected and there're only 12 classes instead of 80. For some reasons I only computed the metrics for 9 of them. Here are the results (only the Total which is an average of the score for each class):608, threshold 25 Precision | Recall | F1-score V3: 0,61549185982171 | 0,671530201992737 | 0,621010022501951 V4: 0,663202060693532 | 0,736413819590893 | 0,687925807514964
608, threshold 80 Precision | Recall | F1-score V3: 0,868105009571517 | 0,436010148612272 | 0,537643580019121 V4: 0,901355203221005 | 0,424318766470518 | 0,549078141383253
320, threshold 25 Precision | Recall | F1-score V3: 0,640359897272839 | 0,587524655365978 | 0,595209573385479 V4: 0,6430859793567 | 0,603819317535369 | 0,611881290125521
320, threshold 80 Precision | Recall | F1-score V3: 0,893357616827859 | 0,365742855878961 | 0,48084793618182 V4: 0,897015515037863 | 0,337629701197596 | 0,459542788643878
To be sure I also ran these metrics on the original COCO's 2014 validation set:
320, threshold 25 Precision | Recall | F1-score V3: 0.6847751451323804 | 0.585218266065114 | 0.6310944915863655 V4: 0.7189399898309946 | 0.563205555933819 | 0.6316147148712133
320, threshold 80 Precision | Recall | F1-score V3: 0.9465013941313708 | 0.33975965139238784 | 0.5000275563527413 V4: 0.962967383520323 | 0.3021345407503234 | 0.459956154671393
I'm surprised that YoloV4 doesn't perform much better than yoloV3, having even lower score when threshold increase. I though YoloV3 has been trained using COCO (the one we can download on pjreddie website), I assumed it was also the case for YoloV4, is it the case? I'm not aware of how YoloV4 has been trained. My feeling is that YoloV4 is less confident with its detections, thus increasing the threshold lower its recall. Another point which surprised me is that YoloV4 is more likely to outperform yoloV3 as resolution increase (the gap between them is wider in 608 than in 320) If would be grateful if someone as some explanation about these metrics!