Closed yifliu3 closed 10 months ago
The rationale to use cross-attention layers is to reduce the length of input to the pretrained ViT (e.g., from 512 to 196 for ViT-B16), so as to lower the computational complexity while maintaining good performance. Using 4 perceiver blocks yields slightly better performance over other configurations, and we can reduce the number of parameters by sharing parameters among perceiver blocks.
For 3d shape, we haven't conducted experiments for using self-attention layers as perceiver mainly due to the long input sequence to ViT (512 point patches), and the computation will become more expensive when scaling to a larger ViT. However, I believe using self-attention layers shall work fine and is certainly worth trying.
Got it. Thanks a lot for your response.
Hi.
I read the code and find that you implement the perceiver as 4 cross-attention layers with 4 self-attention layers each, and I'm curious why not just use 16 or less self-attention layers?