autonomousvision / transfuser

[PAMI'23] TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving; [CVPR'21] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving
MIT License
1.1k stars 185 forks source link

Cross-Modal Attention Statistics #236

Closed MCUBE-2023 closed 3 weeks ago

MCUBE-2023 commented 4 weeks ago

Hi,

Based on Table 4 (Cross-Modal Attention Statistics) in your article, you are reporting the % of tokens for which at least 1 of the top-5 attended tokens belongs to the other modality for each head of the transformers: T1, T2, T3, T4. I have this question: Are you reporting the % of tokens for each frame, and then you are doing in average of the % for the total number of frames ? Or does this % presented in Table 4 represents only a sample from one frame ?

image

ap229997 commented 3 weeks ago

% is computed across all the frames, more details here.

MCUBE-2023 commented 2 weeks ago

@ap229997 Thank you so much!