Inappropriate late fusion range? and Rethinking the AP metric without global sorting

yifanlu0227 commented 1 year ago

Hi Runsheng, first and foremost, I'd like to express my gratitude for your framework again! However, I hope you wouldn't mind if I suggest some inappropriate parts in the code. I believe discussing them together would greatly benefit our research on collaborative perception.

Inappropriate late fusion range？

In your config file, the detection range for the single-agent model is [-+70.4m, -+40m]. This is reasonable for training, but would come across difference with intermediate fusion in evaluation, which holds the detection range [-+140.8m, -+40m].

In late fusion evaluation, the ground truth is in [-+140m, -+40m], but the detection is limited to [-+70.4m, -+40m]. This will cause the potential detection boxes in the distance (between 70m and 140m) to be wiped out. It maybe unfair.

I change the detection range in late fusion evaluation to [-+140.8m, -+40m], and you can see the difference. Some boxes can be detected now, but it also introduces more false postives. I list the AP without global sorting as your code do.

Rethinking the AP metric without global sorting

You mentioned in issue #101 that "since all methods using opencood all use locally sorted evaluation, it should be a fair comparison.". However, I suggest that this inference might be taken for granted, and it may not necessarily be the case in reality.

Actually, after global sorting for all detections, those false positive samples, which usually come with a low confidence score, will be ranked at the bottom. That is to say, with the correct AP calculation method, the tolerance for false positives is higher than if no global sorting was performed. This is quite evident in late fusion, and also affects all methods.

I want to show some results w./w.o global sorting in AP calculation, all the model are downloaded from readme: AP50	Method	Late Fusion [-+70.4m, -+40m]	Late Fusion [-+140.8m, -+40m]
AP w.o. global sorting	0.858	0.824	0.905
AP w. global sorting	0.905	0.925	0.936
Increasement	0.047	0.101	0.031

AP70	Method	Late Fusion [-+70.4m, -+40m]	Late Fusion [-+140.8m, -+40m]
AP w.o. global sorting	0.781	0.736	0.815
AP w. global sorting	0.855	0.867	0.881
Increasement	0.074	0.131	0.066

Late fusion tends to predict more false postive but with low confidence. Under the correct AP calculation method, it showes very good performance. Very close to AttFusion.

I believe the current AP calculation method in the repository might be masking some issues. The best practice would be to use the correct AP calculation method with global sorting. I look forward to discussing this with you. OpenCOOD is a pioneer in collaborative perception and more and more researchers are using it. If adjusting the values in the repo entails significant work, it might still be worth notifying users about a potential issue here. What do you think?

DerrickXuNu commented 1 year ago

Hi Yifan,

Thanks for proposing the issue and discussing with me! Here are my responses to your two questions. 1) The reason I set the evaluation range shorter for late fusion was I found making it consistent with intermediate fusion will decrease its AP, as you also show in the table below. But if we are using global sorting, sure, increasing the range should be reasonable. 2) Instead of replacing the original table, I believe that we should add an additional table showing the results with global sorting, and the researchers can choose whichever they want to use as long as their comparison methods are consistent. This is because, in real-world deployment, too many false positives will lead to unsafe planning, so the old version still has practical meanings.

If you think the 2) makes sense to you, can you make a pr to add the global sorting evaluation functions? Notice it shouldn't overwrite the old one, and users can use a flag called --eval_global_sort in inference.py to choose the evaluation methods they want. I will add the table using the new function.

Again, thanks for proposing these suggestions to the repo, and I believe it is the researchers like you that make the whole field better and better.

yifanlu0227 commented 1 year ago

fixed in PR #105. A global sort flag for AP calculation is added to inference.py.

DerrickXuNu / OpenCOOD

Inappropriate late fusion range? and Rethinking the AP metric without global sorting #104

Inappropriate late fusion range？

Rethinking the AP metric without global sorting