Did the visn-lang-attention shares the same weights with the lang-visn-attention in the cross layers? It confuses me if the performance could be better when they uses different weights. Did you try it?
Yes. I tried it. The results of not sharing / sharing are almost the same. Sharing is slightly better (~0.5%) in downstream tasks. I thus share them to save parameters.
Sorry for this bothing.
Did the visn-lang-attention shares the same weights with the lang-visn-attention in the cross layers? It confuses me if the performance could be better when they uses different weights. Did you try it?