Regarding the instruction for Many/Med/Few evaluation

ljj7975 commented 1 week ago

I ran the following command and I assume that this is for zero-shot python train_net.py --num-gpus 4 --config-file configs/nuimages_cr/code_release_v2/naive_ft_shots10_seed_0.yaml --pred_all_class --eval-only MODEL.WEIGHTS models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth OUTPUT_DIR_PREFIX outputs which gave me

09/16 22:27:16 d2.evaluation.fast_eval_api]: COCOeval_opt.accumulate() finished in 9.17 seconds.
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.143
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.244
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.135
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.040
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.132
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.237
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.176
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.337
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.355
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.229
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.347
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.498
[09/16 22:27:16 d2.evaluation.coco_evaluation]: Evaluation results for bbox: 
|   AP   |  AP50  |  AP75  |  APs  |  APm   |  APl   |
|:------:|:------:|:------:|:-----:|:------:|:------:|
| 14.258 | 24.393 | 13.519 | 4.002 | 13.248 | 23.682 |
[09/16 22:27:16 d2.evaluation.coco_evaluation]: Per-category bbox AP: 
| category          | AP     | category       | AP     | category             | AP     |
|:------------------|:-------|:---------------|:-------|:---------------------|:-------|
| car               | 44.868 | truck          | 34.275 | construction_vehicle | 4.789  |
| bus               | 36.316 | trailer        | 1.712  | emergency            | 0.000  |
| motorcycle        | 28.965 | bicycle        | 32.003 | adult                | 22.018 |
| child             | 0.140  | police_officer | 0.310  | construction_worker  | 2.052  |
| personal_mobility | 0.805  | stroller       | 12.906 | pushable_pullable    | 0.228  |
| barrier           | 0.570  | traffic_cone   | 34.679 | debris               | 0.018  |
[09/16 22:27:18 detectron2]: Evaluation results for nuimages_all_cls_val_no_wc in csv format:
[09/16 22:27:18 d2.evaluation.testing]: copypaste: Task: bbox
[09/16 22:27:18 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl
[09/16 22:27:18 d2.evaluation.testing]: copypaste: 14.2585,24.3930,13.5193,4.0025,13.2478,23.6818

Q1. AP I got is 14.2585 which is slightly different from Detic zero-shot reported in the paper (14.40). I want to confirm that the difference in scores are expected. Q2. Can you guide me on what I need to change (config/setups/command) in order to get Many/Med/Few AP?

anishmadan23 commented 1 week ago

Hey, Q1. I can't seem to remember why this difference comes, but I have seen it happen due to some model/pytorch versioning change that happened during the course of this project. Nevertheless, what you are getting seems to be okay.

Q2. Once you get the category-wise AP, you can group according to this frequency-class map: freq_cls_map = {'many':['car', 'adult', 'truck', 'barrier', 'traffic_cone'], 'med':['construction_vehicle', 'bus', 'trailer', 'motorcycle', 'bicycle', 'construction_worker','pushable_pullable'], 'few':['emergency', 'child', 'police_officer', 'personal_mobility', 'stroller', 'debris']}

ljj7975 commented 1 week ago

So, is the number you reported in the paper simply an average of corresponding entries from the table?

| category          | AP     | category       | AP     | category             | AP     |
|:------------------|:-------|:---------------|:-------|:---------------------|:-------|
| car               | 44.868 | truck          | 34.275 | construction_vehicle | 4.789  |
| bus               | 36.316 | trailer        | 1.712  | emergency            | 0.000  |
| motorcycle        | 28.965 | bicycle        | 32.003 | adult                | 22.018 |
| child             | 0.140  | police_officer | 0.310  | construction_worker  | 2.052  |
| personal_mobility | 0.805  | stroller       | 12.906 | pushable_pullable    | 0.228  |
| barrier           | 0.570  | traffic_cone   | 34.679 | debris               | 0.018  |

I got 27.282/15.152/2.363 which is slightly different from the one in the paper (25.83/16.59/2.32) but I guess this is close enough.

Thank you for your response! I really appreciate it

anishmadan23 / foundational_fsod

Regarding the instruction for Many/Med/Few evaluation #2