DRY_RUN mode's performance

Lukas-Ma1 commented 10 months ago

I would like to ask about coco training and test performance on DRY_RUN mode. I use DRY_RUN on each command you mentioned, including extract globals, objects and blocks features. When I run: DRY_RUN=True TRAIN_WITH_VAL_DATASET=True torchrun --nproc_per_node=4 -m oadp.dp.train oadp_ov_coco configs/dp/oadp_ov_coco.py --override .validator.dataloader.dataset.ann_file::data/coco/annotations/instances_val2017.48.json

The DP training for coco, the result be like:

2023-11-11 23:43:33,261 - mmdet - INFO - Iter(val) [1] COCO_48_17_bboxmAP: 0.8614, COCO_48_17_bbox_mAP_50: 0.8614, COCO_48_17_bbox_mAP_75: 0.8614, COCO_48_17_bbox_mAP_s: 0.7921, COCO_48_17_bbox_mAP_m: 1.0000, COCO_48_17_bbox_mAP_l: 1.0000, COCO_48_17_bbox_mAP_copypaste: 0.8614 0.8614 0.8614 0.7921 1.0000 1.0000, COCO_48_bboxmAP: 0.8614, COCO_48_bbox_mAP_50: 0.8614, COCO_48_bbox_mAP_75: 0.8614, COCO_48_bbox_mAP_s: 0.7921, COCO_48_bbox_mAP_m: 1.0000, COCO_48_bbox_mAP_l: 1.0000, COCO_48_bbox_mAP_copypaste: 0.8614 0.8614 0.8614 0.7921 1.0000 1.0000, COCO_17_bboxmAP: -1.0000, COCO_17_bbox_mAP_50: -1.0000, COCO_17_bbox_mAP_75: -1.0000, COCO_17_bbox_mAP_s: -1.0000, COCO_17_bbox_mAP_m: -1.0000, COCO_17_bbox_mAP_l: -1.0000, COCO_17_bbox_mAP_copypaste: -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
2023-11-11 23:43:33,606 - mmdet - INFO - Saving checkpoint at 40000 iterations
2023-11-11 23:43:35,052 - mmdet - INFO - Iter [40000/40000.0] lr: 2.000e-03, eta: 0:00:00, time: 4.103, data_time: 2.348, memory: 2148, loss_rpn_cls: 0.0000, loss_rpn_bbox: 0.0011, loss_cls: 0.0021, acc: 99.9512, loss_bbox: 0.0073, loss_global: 0.0002, recall_global: 69.3125, loss_block: 0.0016, recall_block: 11.1172, loss_clip_objects: 0.6160, loss_clip_global: 0.2377, loss_clip_blocks: 0.5239, loss_clip_block_relations: 0.0503, loss: 1.4403

I got a ridiculous result: 0.8614 mAP, there must be something wrong, but I check my process, data structure and commands, all these are following your steps. So is DRY_RUN makes this unreal result and I should turn to run without DRY_RUN?(By the way, extract globals, objects and blocks features on DRY_RUN seems to be smaller than without DRY_RUN's, so I should download from Baidu disk?)

Thanks for your attention and impressive work!

LutingWang commented 10 months ago

The DRY_RUN mode is intended for fast validation of the code's correctness. It does this by trimming down the number of samples used for training or validation. So, getting a 0.8614 result indicates that you've set everything up correctly and you're all set for a full run.

Lukas-Ma1 commented 10 months ago

Thank you, there are another 2 questions I would like to ask:

I have run another training process with TRAIN_WITH_VAL_DATASET=True torchrun --nproc_per_node=4 -m oadp.dp.train oadp_ov_coco configs/dp/oadp_ov_coco.py --override .validator.dataloader.dataset.ann_file::data/coco/annotations/instances_val2017.48.json, without DRY_RUN. However, the result still not seem to be normal, it only gets 0.1649 on mAPN50, doesn't reach 0.313. Should I reduce all optional parts and run 'torchrun --nproc_per_node=4 -m oadp.dp.train vild_ov_coco configs/dp/vild_ov_coco.py'?
In your paper, the attention mask is a (N+1, N+1) matrix, why the value of right-down corner is 1?

LutingWang commented 10 months ago

Apologies for any confusion. When TRAIN_WITH_VAL_DATASET=True is set in the environment, it activates a debug mode, wherein the training is performed using the validation split of the MS-COCO 2017 dataset.
Please refer to eq. (9). The additional token is $x_\texttt{[OBJ]}$.

LutingWang / OADP

DRY_RUN mode's performance #14