TencentARC / ViT-Lens

[CVPR 2024] ViT-Lens: Towards Omni-modal Representations
https://ailab-cvc.github.io/seed/vitlens/
Other
140 stars 9 forks source link

Why use the cross-attention instead of only self-attention when implementing perceiver layers? #2

Closed yifliu3 closed 10 months ago

yifliu3 commented 10 months ago

Hi.

I read the code and find that you implement the perceiver as 4 cross-attention layers with 4 self-attention layers each, and I'm curious why not just use 16 or less self-attention layers?

StanLei52 commented 10 months ago

The rationale to use cross-attention layers is to reduce the length of input to the pretrained ViT (e.g., from 512 to 196 for ViT-B16), so as to lower the computational complexity while maintaining good performance. Using 4 perceiver blocks yields slightly better performance over other configurations, and we can reduce the number of parameters by sharing parameters among perceiver blocks.

For 3d shape, we haven't conducted experiments for using self-attention layers as perceiver mainly due to the long input sequence to ViT (512 point patches), and the computation will become more expensive when scaling to a larger ViT. However, I believe using self-attention layers shall work fine and is certainly worth trying.

yifliu3 commented 10 months ago

Got it. Thanks a lot for your response.