autonomousvision / transfuser

[PAMI'23] TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving; [CVPR'21] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving
MIT License
1.11k stars 185 forks source link

Hi, some questions about the model! #35

Closed roger-cv closed 2 years ago

roger-cv commented 2 years ago

Hi, Thanks for this nice work. Recently, I am trying to modify this excellent model to make it suit for my work. As you see, why the "average" operation is required here, the shape of the tensor changes from 6464 to 88 after the "average" operation. However, the shape of the tensor should be the 64*64 just like the description" fusion at (B,64,64,64)". QQ截图20211202152824

ap229997 commented 2 years ago

The fusion can also be done at 64x64 resolution but that would be too computationally expensive since a transformer is used (quadratic complexity due to attention), so I reduced the size to 8x8 at each resolution of the intermediate feature maps.

roger-cv commented 2 years ago

Thanks for your quick reply. I guess that the input feature map of the transformer of each layer will be downsampled to 8*8 according to what you mean?

ap229997 commented 2 years ago

that's correct, now there are several variants of transformer which address the quadratic complexity issue of the transformer (eg. Linformer) so maybe it's possible to use the transformer without downsampling.

roger-cv commented 2 years ago

that's correct, now there are several variants of transformer which address the quadratic complexity issue of the transformer (eg. Linformer) so maybe it's possible to use the transformer without downsampling.

Ok, Another interesting question is that can this fusion fashion based on the transformer be replaced with other transformers, such as swim or PVT. Because I notice that this transformer is developed based on the GPT suited for the NLP area.

ap229997 commented 2 years ago

I agree, architecture design can be improved quite a bit.

roger-cv commented 2 years ago

Ok, Nice work, Thanks for your reply.

Kin-Zhang commented 2 years ago

But it may require more resources to train...

I agree, architecture design can be improved quite a bit.