autonomousvision / transfuser

[PAMI'23] TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving; [CVPR'21] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving
MIT License
1.12k stars 186 forks source link

Relationship between image resolution and backbone. #195

Closed xanhug closed 10 months ago

xanhug commented 10 months ago

Thanks for your amazing work! I noticed that in 'Multi-Modal Fusion Transformer for End-to-End Autonomous Driving,' ResNet34 and ResNet18 were respectively used for the image and Lidar branches. The feature channels before flattening were 512. However, in 'TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving,' RegNetY-3.2GF was used as the backbone, and the feature channels before flattening were 1512. Is this change possibly due to the consideration that the latter utilizes images concatenated from three different perspectives as the input for the image branch, requiring more feature channels to encompass additional information?

Kait0 commented 10 months ago

No, the 1512 channels are a result of using RegNetY-3.2GF which always has 1512 channels at the end regardless of the input size of the image. RegNets are well tuned ResNets for the most part, which means hyperparameters like number of channels differ between the architecture.

xanhug commented 10 months ago

No, the 1512 channels are a result of using RegNetY-3.2GF which always has 1512 channels at the end regardless of the input size of the image. RegNets are well tuned ResNets for the most part, which means hyperparameters like number of channels differ between the architecture.

Thanks for your reply.