Closed farleylai closed 4 years ago
This could explain why YOLOv4 does not produce the same validation results on both but a significantly better mAP on val2017.
Yes, it seems that YOLOv4 will show worse AP for val2017 but better AP for test-dev if we will use COCO2017 with crowd=0.
http://cocodataset.org/#detection-2019
Participants are recommended but not restricted to train their algorithms on COCO 2017 train and val sets.
We trained YOLOv4 only on train(without val) dataset, while COCO 2017 Task Guidelines reccommend to train by using train + val5k
datasets.
So we used:
Yes, by using train+5k and COCO 2017 instead of 2014 can increase YOLOv4 accuracy.
It is surprising that YOLO does not follow the same convention given Ross Girshick is also one of the authors in the YOLOv1 paper. Any clarifications to address the confusion would be welcome.
Ross Girshick used train+val for training FasterRCNN to test it on test-dev dataset, without train/val splitting - therefore splitting is not important: https://arxiv.org/pdf/1506.01497v3.pdf
It seems that the most other network was trained by using COCO trainva
dataset for testing on test-dev: https://pjreddie.com/darknet/yolo/
Thanks for the elaboration but there seems like the issue was not made clear in the beginning.
The motivation is in view of the evaluation on val2017
using yolov4.weights
as follows which looks too good to be true, implying the train/val split of YOLO is quite different from COCO 2017. After further investigation, it is also NOT the same trainval35k
that other related work follows.
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.506
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.739
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.564
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.321
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.573
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.644
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.369
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.607
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.660
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.495
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.729
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.801
The trainval35k
is from ION earlier than the Faster RCNN (both involved Ross Girshick). It is understandable if the goal is to evaluate on test-dev
, combining trainval
is acceptable but not a common practice nowadays because a 5k minival
is handy for various ablation experiments in recent object detection work including Mask-RCNN, RetinaNet, Cascade RCNN and more. Also, the relatively small 5k minival
is unlikely to contribute significantly to the performance on test-dev
. Therefore, training on trainval35k
from COCO 2014 or simply train
from COCO 2017 makes it straightforward to compare different models, especially when reimplemented by independent third parties such as mmdetection
and detectron2
following the same train/val
splits. Hence, the issue is just curious about why YOLO makes a DIFFERENT trainval35k
to train from COCO 2014. Alternatively, why not simply train YOLOv4 on COCO 2017 train
set that could be more consistent with recent work?
From Mask RCNN:
As in previous work [5, 27], we train using the union of 80k train images and a 35k subset of val images (
trainval35k
), and report ablations on the remaining 5k val images (minival
). We also report results on test-dev [28]
From RetinaNet:
We present experimental results on the bounding box detection track of the challenging COCO benchmark [21]. For training, we follow common practice [1, 20] and use the COCO
trainval35k
split (union of 80k images from train and a random 35k subset of images from the 40k image val split). We report lesion and sensitivity studies by evaluating on theminival
split (the remaining 5k images from val). For our main results, we report COCO AP on the test-dev split.
BTW, there is little concern on the crowd setting here since the COCO evaluation seems to ignore for fair comparison:
For the purpose of evaluation, areas marked as crowds will be ignored and not affect a detector’s score. Details are given in the appendix.
Yes, YOLOv4 can be compared on test-dev, as we done in the paper. But can't be compared on minval5k, so we didn't do this. We don't train on minval5k, but these minval5k's are apparently different.
I don't know why Ross Girshick used this strange splitting when he was co-author of Yolov1.
Yes, may be we should re-train YOLOv4 on COCO2017.
Yes, COCO evaluation ignores crowd=1 detections/truths. But crowd=1 can occupy part of the network capacity, I don’t know how much it affects.
Also MSCOCO seems poorly annotated for persons. https://github.com/AlexeyAB/darknet/issues/4085
Sounds all good and looking forward to YOLOv5! In practice, COCO is not a very large image dataset and training to recognize custom object categories is quite necessary. Evaluation on how efficiently YOLO can be transferred or fine-tuned to learn new classes would be very useful and informative in that aspect.
@WongKinYiu Hi, May be we should use COCO 2017 for training new models, for faster evaluation of them for the minval5k, while integrating them into a third-party library.
Therefore, training on trainval35k from COCO 2014 or simply train from COCO 2017 makes it straightforward to compare different models, especially when reimplemented by independent third parties such as mmdetection and detectron2 following the same train/val splits. Therefore, the issue is just curious about why YOLO makes a DIFFERENT trainval35k to train from COCO 2014. Alternatively, why not simply train YOLOv4 on COCO 2017 train set that could be more consistent with recent work?
@AlexeyAB OK,
After finish current training, I will train all of new models with COCO 2017. I success combine CSP in neck part, I will train YOLOv4(CSP) with COCO 2017 tomorrow.
@WongKinYiu Will it use CSP+SAM+Mish, or just CSP for Neck? Should we use new-MiWRC for backbone or neck?
@AlexeyAB
Will train two models, 1) CSP+Leaky for quick evaluation, and 2) CSP+SAM+Mish while it is the best known combination.
Currently I think MiWRC is not stable enough, when sorting by top-1 accuracy: new-MiWRC-per_channel-relu > new-MiWRC-per_feature-softmax > new-MiWRC-per_feature-relu > new-MiWRC-per_channel-softmax. I can not find any rule of the performance, and I think the results may be random.
TL;DR
The custom splits (
trainvalno5k.txt
and5k.txt
) for COCO 2014 are supposed to be the same as the default splits for COCO 2017 but they are not. This could explain whyYOLOv4
does not produce the same validation results on both but a significantly better mAP onval2017
. Also, this implies YOLO may be trained on a different training set compared with other object detectors. Then a direct comparison might not be fair. Any clarifications?According the COCO website, both releases contain the same images and detection annotations:
The only difference is the splits which COCO 2017 adopts as the long-time convention from COCO 2014 in early object detection work. The detectron repo explicitly describes the COCO Minival Annotations (5k) as follows:
Therefore, COCO_2017_train = COCO_2014_train +
valminusminival
, COCO_2017_val =minival
, wherevalminusminival
andminival
are from COCO 2014 val as the conventional custom splits.Now, looking into their annotations instances_minival2014.json(provided by the very original author), the last 5 files are listed as follows:
The above are the same as the last 5 files in
val2017
:However, the
5k.txt
split used by YOLO for COCO 2014 lists the following that is apparently different from the conventional custom splits:It is surprising that YOLO does not follow the same convention given Ross Girshick is also one of the authors in the YOLOv1 paper. Any clarifications to address the confusion would be welcome.