Open Divadi opened 2 years ago
Hi @Divadi, the impact of SMCA (more precisely, the second transformer decoder layer) is ablated in Table 7. I didn't provide the performance comparison between spatial-modulated cross-attention and traditional cross-attention, but from my experience, the latter convergences much slower and reaches a weaker final performance.
I have only tried learned positional encoding, but sin/cos or Fourier positional encoding also worth trying.
Thank you for open-sourcing your work. I was wondering, did you perform an ablation on the impact SMCA has on the network? Apologies if I missed it in the paper.
Also, I found that you make extensive use of positional encodings learned from coordinates, with no (I think) use of sin/cos encodings. Did you ever try the latter/were the former much better?