Inconsistent splits between COCO 2014 and COCO 2017?

farleylai commented 4 years ago

TL;DR

The custom splits (trainvalno5k.txt and 5k.txt) for COCO 2014 are supposed to be the same as the default splits for COCO 2017 but they are not. This could explain why YOLOv4 does not produce the same validation results on both but a significantly better mAP on val2017. Also, this implies YOLO may be trained on a different training set compared with other object detectors. Then a direct comparison might not be fair. Any clarifications?

According the COCO website, both releases contain the same images and detection annotations:

the split is now 118K/5K for train/val. The same exact images are used, and no new annotations for detection/keypoints are provided.

The only difference is the splits which COCO 2017 adopts as the long-time convention from COCO 2014 in early object detection work. The detectron repo explicitly describes the COCO Minival Annotations (5k) as follows:

Our custom minival and valminusminival annotations are available for download here. Please note that minival is exactly equivalent to the recently defined 2017 val set. Similarly, the union of valminusminival and the 2014 train is exactly equivalent to the 2017 train set.

Therefore, COCO_2017_train = COCO_2014_train + valminusminival , COCO_2017_val = minival, where valminusminival and minival are from COCO 2014 val as the conventional custom splits.

Now, looking into their annotations instances_minival2014.json(provided by the very original author), the last 5 files are listed as follows:

COCO_val2014_000000581317.jpg
COCO_val2014_000000581357.jpg
COCO_val2014_000000581482.jpg
COCO_val2014_000000581615.jpg
COCO_val2014_000000581781.jpg

The above are the same as the last 5 files in val2017:

$> ls images/val2017 | sort | tail
000000581317.jpg
000000581357.jpg
000000581482.jpg
000000581615.jpg
000000581781.jpg

However, the 5k.txt split used by YOLO for COCO 2014 lists the following that is apparently different from the conventional custom splits:

$> sort 5k.txt | tail
./coco/images/val2014/COCO_val2014_000000581655.jpg
../coco/images/val2014/COCO_val2014_000000581731.jpg
../coco/images/val2014/COCO_val2014_000000581781.jpg
../coco/images/val2014/COCO_val2014_000000581887.jpg
../coco/images/val2014/COCO_val2014_000000581899.jpg

It is surprising that YOLO does not follow the same convention given Ross Girshick is also one of the authors in the YOLOv1 paper. Any clarifications to address the confusion would be welcome.

AlexeyAB commented 4 years ago

This could explain why YOLOv4 does not produce the same validation results on both but a significantly better mAP on val2017.

Yes, it seems that YOLOv4 will show worse AP for val2017 but better AP for test-dev if we will use COCO2017 with crowd=0.

http://cocodataset.org/#detection-2019

Participants are recommended but not restricted to train their algorithms on COCO 2017 train and val sets.

We trained YOLOv4 only on train(without val) dataset, while COCO 2017 Task Guidelines reccommend to train by using train + val5k datasets.

So we used:

less data for training - that decreases accuracy
labels with crowd=1 - that decreases accuracy
old 2014 labels - I don't know were some labels are fixed

Yes, by using train+5k and COCO 2017 instead of 2014 can increase YOLOv4 accuracy.

It is surprising that YOLO does not follow the same convention given Ross Girshick is also one of the authors in the YOLOv1 paper. Any clarifications to address the confusion would be welcome.

Ross Girshick used train+val for training FasterRCNN to test it on test-dev dataset, without train/val splitting - therefore splitting is not important: https://arxiv.org/pdf/1506.01497v3.pdf

It seems that the most other network was trained by using COCO trainva dataset for testing on test-dev: https://pjreddie.com/darknet/yolo/

farleylai commented 4 years ago

Thanks for the elaboration but there seems like the issue was not made clear in the beginning.
The motivation is in view of the evaluation on val2017 using yolov4.weights as follows which looks too good to be true, implying the train/val split of YOLO is quite different from COCO 2017. After further investigation, it is also NOT the same trainval35k that other related work follows.

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.506
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.739
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.564
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.321
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.573
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.644
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.369
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.607
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.660
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.495
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.729
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.801

The trainval35k is from ION earlier than the Faster RCNN (both involved Ross Girshick). It is understandable if the goal is to evaluate on test-dev, combining trainval is acceptable but not a common practice nowadays because a 5k minival is handy for various ablation experiments in recent object detection work including Mask-RCNN, RetinaNet, Cascade RCNN and more. Also, the relatively small 5k minival is unlikely to contribute significantly to the performance on test-dev. Therefore, training on trainval35k from COCO 2014 or simply train from COCO 2017 makes it straightforward to compare different models, especially when reimplemented by independent third parties such as mmdetection and detectron2 following the same train/val splits. Hence, the issue is just curious about why YOLO makes a DIFFERENT trainval35k to train from COCO 2014. Alternatively, why not simply train YOLOv4 on COCO 2017 train set that could be more consistent with recent work?

From Mask RCNN:

As in previous work [5, 27], we train using the union of 80k train images and a 35k subset of val images (trainval35k), and report ablations on the remaining 5k val images (minival). We also report results on test-dev [28]

From RetinaNet:

We present experimental results on the bounding box detection track of the challenging COCO benchmark [21]. For training, we follow common practice [1, 20] and use the COCO trainval35k split (union of 80k images from train and a random 35k subset of images from the 40k image val split). We report lesion and sensitivity studies by evaluating on the minival split (the remaining 5k images from val). For our main results, we report COCO AP on the test-dev split.

BTW, there is little concern on the crowd setting here since the COCO evaluation seems to ignore for fair comparison:

For the purpose of evaluation, areas marked as crowds will be ignored and not affect a detector’s score. Details are given in the appendix.

AlexeyAB commented 4 years ago

Yes, YOLOv4 can be compared on test-dev, as we done in the paper. But can't be compared on minval5k, so we didn't do this. We don't train on minval5k, but these minval5k's are apparently different.

I don't know why Ross Girshick used this strange splitting when he was co-author of Yolov1.

Yes, may be we should re-train YOLOv4 on COCO2017.

Yes, COCO evaluation ignores crowd=1 detections/truths. But crowd=1 can occupy part of the network capacity, I don’t know how much it affects.

Also MSCOCO seems poorly annotated for persons. https://github.com/AlexeyAB/darknet/issues/4085

66252199-20905d80-e776-11e9-885f-73c4efc4751d

farleylai commented 4 years ago

Sounds all good and looking forward to YOLOv5! In practice, COCO is not a very large image dataset and training to recognize custom object categories is quite necessary. Evaluation on how efficiently YOLO can be transferred or fine-tuned to learn new classes would be very useful and informative in that aspect.

AlexeyAB commented 4 years ago

@WongKinYiu Hi, May be we should use COCO 2017 for training new models, for faster evaluation of them for the minval5k, while integrating them into a third-party library.

Therefore, training on trainval35k from COCO 2014 or simply train from COCO 2017 makes it straightforward to compare different models, especially when reimplemented by independent third parties such as mmdetection and detectron2 following the same train/val splits. Therefore, the issue is just curious about why YOLO makes a DIFFERENT trainval35k to train from COCO 2014. Alternatively, why not simply train YOLOv4 on COCO 2017 train set that could be more consistent with recent work?

WongKinYiu commented 4 years ago

@AlexeyAB OK,

After finish current training, I will train all of new models with COCO 2017. I success combine CSP in neck part, I will train YOLOv4(CSP) with COCO 2017 tomorrow.

AlexeyAB commented 4 years ago

@WongKinYiu Will it use CSP+SAM+Mish, or just CSP for Neck? Should we use new-MiWRC for backbone or neck?

WongKinYiu commented 4 years ago

@AlexeyAB

Will train two models, 1) CSP+Leaky for quick evaluation, and 2) CSP+SAM+Mish while it is the best known combination.

Currently I think MiWRC is not stable enough, when sorting by top-1 accuracy: new-MiWRC-per_channel-relu > new-MiWRC-per_feature-softmax > new-MiWRC-per_feature-relu > new-MiWRC-per_channel-softmax. I can not find any rule of the performance, and I think the results may be random.

AlexeyAB / darknet

Inconsistent splits between COCO 2014 and COCO 2017? #5751