XuyangBai / TransFusion

[PyTorch] Official implementation of CVPR2022 paper "TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers". https://arxiv.org/abs/2203.11496
Apache License 2.0
619 stars 76 forks source link

some question about unfair compare #48

Closed AndyYuan96 closed 2 years ago

AndyYuan96 commented 2 years ago

Hi, xuyang,I reproduce the result of transfusion-l-pillar model, as it did have improvement compared with center point-pillar, I just wonder where the improvement come from? So I did some experiment. I find when I use some training config which is used by transfusion(like use 10 sweep, setting dataset config's valid_mask=false, gtaug fade strategy) to train centerpoint-pillar, centerpoint's performance also boost, and interestingly, centerpoint's nds is higher than transfusion-l-pillar more than 1point, and centerpoint's map is lower than transfusion only 0.1 point. what's more, For all the experiment, I only do once. So it's very interesting that whether transformer is indeed better 。

XuyangBai commented 2 years ago

Hi @AndyYuan96 Thanks for your interest in our work. The fade strategy indeed has a large effect on the final performance by reducing many FP, and I never hide this fact. And similar with your experiment, we have tried TransFusionL_VoxelNet without the fade strategy and find it achieves 61.4 mAP & 68.1 NDS on the nuScenes benchmark, you can check it here. And this performance already surpasses CenterPoint (58.1 mAP & 67.1 NDS) with similar settings.

Personally, Transformer is good because 1). the transformer-based detection head with bipartite matching loss alleviates the need for NMS post-processing, and makes the detection system end-to-end and fast. 2) the transformer architecture is naturally compatible with our soft-association idea where the multi-modal fusion is conducted using an attention mechanism to dynamically learn what and where to fuse. 3). the attention provides the change to leverage long-term dependency and a large receptive field, which could be helpful for detection or other downstream tasks.

AndyYuan96 commented 2 years ago

I get it。I just want to sharing the result about pillar backbone’s result,I will also try voxel backbone to see the result。