Different per category AP scores from the paper & potential bug in the evaluation

ashawkey commented 3 years ago

Hello, thanks for the amazing work!

I'm trying to reproduce the results with the pre-trained model, but I got quite different per category AP scores from the paper:

|           | display | bathtub | trashbin | sofa  | chair | table | cabinet | bookshelf | mAP   |
| --------- | ------- | ------- | -------- | ----- | ----- | ----- | ------- | --------- | ----- |
| paper     | 26.67   | 27.57   | 23.34    | 15.71 | 12.23 | 1.92  | 14.48   | 13.39     | 16.90 |
| reproduce | 23.13   | 15.89   | 18.00    | 41.61 | 10.13 | 0.95  | 26.35   | 9.10      | 18.14 |

Besides, there seems to be a lot of false positives at conf_thresh = 0.05:

----------iou_thresh: 0.500000----------
[eval mesh] table
[eval mesh] prec = 0.0037091005431182937 (28.0/7549.0 | rec = 0.05063291139240506(28.0/553) | ap = 0.00946969696969697
[eval mesh] chair
[eval mesh] prec = 0.01814809908597165 (137.0/7549.0 | rec = 0.1253430924062214(137.0/1093) | ap = 0.10131491817235834
[eval mesh] bookshelf
[eval mesh] prec = 0.002119486024639025 (16.0/7549.0 | rec = 0.07547169811320754(16.0/212) | ap = 0.09090909090909091
[eval mesh] sofa
[eval mesh] prec = 0.007948072592396344 (60.0/7549.0 | rec = 0.5309734513274337(60.0/113) | ap = 0.416168487597059
[eval mesh] trash_bin
[eval mesh] prec = 0.010994833752814943 (83.0/7549.0 | rec = 0.3577586206896552(83.0/232) | ap = 0.18000805806512327
[eval mesh] cabinet
[eval mesh] prec = 0.017618227579811897 (133.0/7549.0 | rec = 0.5115384615384615(133.0/260) | ap = 0.26358882912551806
[eval mesh] display
[eval mesh] prec = 0.008610411975096039 (65.0/7549.0 | rec = 0.3403141361256545(65.0/191) | ap = 0.23137496193523358
[eval mesh] bathtub
[eval mesh] prec = 0.005961054444297258 (45.0/7549.0 | rec = 0.375(45.0/120) | ap = 0.15889753331566212

Is this expected? Or should I use a higher confidence threshold?

yinyunie commented 3 years ago

Hi,

The original code for paper was implemented under pytorch 1.1.0 (and only runnable under this version). We upgraded the code adaptive to new pytorch and pointnet++ to make it earlier for more users. The pre-trained weights were also retrained under the new pytorch and pointnet++ libs. There could be some differences, and you can see our claim here.

ashawkey commented 3 years ago

Thanks for the clarification! I didn't expect the difference to be this large. And for the second question, I have tried to use conf_thresh = 0.8, but still got 3179 mesh proposals per category. Is this a normal behaviour to have so many false positives?

yinyunie commented 3 years ago

Hi,

For each bbox proposal, we predict a shape correspondingly. so actually the number of box proposals and mesh proposals are equal. For the detection part, we followed the architecture of votenet. Maybe you can refer to their code for the false positives problem.

Hope this addressed your questions.

ashawkey commented 3 years ago

Hi,

After checking the code again, I think there is a mistake in the online evaluation code, which causes the abnormal number of FPs. The online evaluation code seems to repeatedly use the same mesh proposal for 8 classes, even if the network has predicted the class of the mesh (and use the class code to generate the mesh).

the original part:

# i = batch_id, ii = label, j = proposal_id
# e.g., we already know proposal j is table, but this line will add it repeatedly as table, chair, sofa, ..., for evaluating mAP.
sample_idx = [(ii, j) for ii in range(config_dict['dataset_config'].num_class) for j in range(N_proposals) if pred_mask[i, j] == 1 and obj_prob[i, j] > config_dict['conf_thresh']]

which in my opinion should be:

sem_cls_preds = sem_cls_probs.argmax(2) # add earlier for convenience
# e.g., only add proposal j as table for evaluating mAP.
sample_idx = [(sem_cls_preds[i, j], j) for j in range(N_proposals) if pred_mask[i, j] == 1 and obj_prob[i, j] > config_dict['conf_thresh']]

This will also speed up the evaluation greatly since the total number of the proposals processed is divided by 8.

ashawkey commented 3 years ago

@yinyunie Looking forward to your reply. This can be helpful to follow your work.

yinyunie commented 3 years ago

Hi, thanks for your comments.

For the mAP calculation, we followed the eval code from votenet. The philosophy here is to assign each predicted proposal box with the corresponding mesh. The number of meshes should be equal to the number of box proposals. In votenet, they also used the same box proposal for all classes in evaluation, in case they need to calculate AP score for each class. could you please check: https://github.com/facebookresearch/votenet/blob/2f6d6d36ff98d96901182e935afe48ccee82d566/eval.py#L41

and their evaluation argument: python eval.py --dataset scannet --checkpoint_path log_scannet/checkpoint.tar --dump_dir eval_scannet --num_point 40000 --cluster_sampling seed_fps --use_3d_nms --use_cls_nms --per_class_proposal

Hope this helps you.

Best regards, Yinyu

ashawkey commented 3 years ago

Hi, So this is an intended behaviour. Thanks a lot!

GAP-LAB-CUHK-SZ / RfDNet

Different per category AP scores from the paper & potential bug in the evaluation #5