Reproduction of oadp_ov_coco.py

Thank you for outstanding work, I got some problems when I try to reproduce the training of coco. Firstly I use your checkpoint and successfully got the same result 31.3 mAP, it proves that the dataset and python environment is correctly set.

And I use the command to train vild first: torchrun --nproc_per_node=2 -m oadp.dp.train vild_ov_coco configs/dp/vild_ov_coco.py, and then formattly train coco: torchrun --nproc_per_node=2 -m oadp.dp.train oadp_ov_coco configs/dp/oadp_ov_coco.py, but I don't get correct result when I use the training checkpoint, here is my full result:

{'COCO_17_bboxmAP': '0.1495', 'COCO_17_bbox_mAP_50': '0.2830', 'COCO_17_bbox_mAP_75': '0.1398', 'COCO_17_bbox_mAP_copypaste': '0.1495 0.2830 0.1398 0.1060 0.1788 0.1816', 'COCO_17_bbox_mAP_l': '0.1816', 'COCO_17_bbox_mAP_m': '0.1788', 'COCO_17_bbox_mAP_s': '0.1060', 'COCO_48_17_bboxmAP': '0.2673', 'COCO_48_17_bbox_mAP_50': '0.4436', 'COCO_48_17_bbox_mAP_75': '0.2798', 'COCO_48_17_bbox_mAP_copypaste': '0.2673 0.4436 0.2798 0.1750 0.2916 0.3488', 'COCO_48_17_bbox_mAP_l': '0.3488', 'COCO_48_17_bbox_mAP_m': '0.2916', 'COCO_48_17_bbox_mAP_s': '0.1750', 'COCO_48_bboxmAP': '0.3090', 'COCO_48_bbox_mAP_50': '0.5005', 'COCO_48_bbox_mAP_75': '0.3293', 'COCO_48_bbox_mAP_copypaste': '0.3090 0.5005 0.3293 0.1994 0.3316 0.4080', 'COCO_48_bbox_mAP_l': '0.4080', 'COCO_48_bbox_mAP_m': '0.3316', 'COCO_48_bbox_mAP_s': '0.1994'}

By the way, I noticed that some abnormal data was output during the training process, the mAP result of coco_17_bbox is -1!!!, here I randomly cut partly of output during training, it is during iteration of 26000/40000:

2023-11-29 19:26:42,471 - mmdet - INFO - Iter(val) [2500] COCO_48_17_bboxmAP: 0.1982, COCO_48_17_bbox_mAP_50: 0.3539, COCO_48_17_bbox_mAP_75: 0.1999, COCO_48_17_bbox_mAP_s: 0.1101, COCO_48_17_bbox_mAP_m: 0.2075, COCO_48_17_bbox_mAP_l: 0.2655, COCO_48_17_bbox_mAP_copypaste: 0.1982 0.3539 0.1999 0.1101 0.2075 0.2655, COCO_48_bboxmAP: 0.1982, COCO_48_bbox_mAP_50: 0.3539, COCO_48_bbox_mAP_75: 0.1999, COCO_48_bbox_mAP_s: 0.1101, COCO_48_bbox_mAP_m: 0.2075, COCO_48_bbox_mAP_l: 0.2655, COCO_48_bbox_mAP_copypaste: 0.1982 0.3539 0.1999 0.1101 0.2075 0.2655, COCO_17_bboxmAP: -1.0000, COCO_17_bbox_mAP_50: -1.0000, COCO_17_bbox_mAP_75: -1.0000, COCO_17_bbox_mAP_s: -1.0000, COCO_17_bbox_mAP_m: -1.0000, COCO_17_bbox_mAP_l: -1.0000, COCO_17_bbox_mAP_copypaste: -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000

And when I add --override to command like: torchrun --nproc_per_node=2 -m oadp.dp.train vild_ov_coco configs/dp/vild_ov_coco.py --override .validator.dataloader.dataset.ann_file::data/coco/annotations/instances_val2017.48.json, the checkpoint becomes unuseful:

why it makes this situation?

It seems like some parts of my experiment is wrong, how can I fixed it? And can you tell me how to use training command correctly? Appreciated!

The training scripts you used are correct:

torchrun --nproc_per_node=2 -m oadp.dp.train vild_ov_coco configs/dp/vild_ov_coco.py
torchrun --nproc_per_node=2 -m oadp.dp.train oadp_ov_coco configs/dp/oadp_ov_coco.py

However, I noticed that you are using 2 GPUs for training, while the original checkpoint was trained with 8 GPUs. When using 2 GPUs, only 4 times less data is used for training compared to the original. To address this issue, there are a few potential solutions:

Increase the batch size for each GPU. Ideally, use a batch size of 8 per GPU.
Use more GPUs for training.
Increase the learning rate.

Regarding the second problem, when mAP=-1, it typically indicates that there are no ground truth objects present. For example, if COCO_17_bbox_mAP_: -1.0000 is shown, it means that there are no novel category objects in the annotation file. Please verify if this is the case, and if so, regenerate the annotation files. If the problem persists, please provide more details, so I can conduct a more thorough investigation.

Lastly, the option --override .validator.dataloader.dataset.ann_file::data/coco/annotations/instances_val2017.48.json is intended to be used in conjunction with TRAIN_WITH_VAL_DATASET. When TRAIN_WITH_VAL_DATASET is set to True, it replaces the training dataset with the validation dataset. However, this can cause errors during training when there are novel category objects in the validation dataset. To avoid these errors, it is recommended to include the option --override .validator.dataloader.dataset.ann_file::data/coco/annotations/instances_val2017.48.json. In your case, since TRAIN_WITH_VAL_DATASET is not set, the additional option is likely to cause erroneous behaviors. The error message indicates that the model has not produced any predictions, which is likely not caused by the override option. However, since the option was incorrectly added, it may be best to disregard this error for now and focus on addressing the previous two errors.

LutingWang / OADP

Reproduction of oadp_ov_coco.py #15