I want to implement cross-attention, such as images as Q and text data as K and V, but the feature_map dimension calculated by dwc will not match x. Do you have any insights on this?

LeapLabTHU / FLatten-Transformer

Official repository of FLatten Transformer (ICCV2023)

377 stars 21 forks source link

I want to implement cross-attention, such as images as Q and text data as K and V, but the feature_map dimension calculated by dwc will not match x. Do you have any insights on this? #19

Closed JinchaoChen112 closed 8 months ago

tian-qing001 commented 8 months ago

Hi @JinchaoChen112, thank you very much for your thoughtful attention to our work. The fundamental concept behind DWC is to maintain feature diversity. Specifically, in the context of cross-attention, there's a possibility to modify the FLatten formula from $$O=\phi_p(Q){\phi_p(K)}^TV+{\rm DWC}(V)$$ to $$O=\phi_p(Q){\phi_p(K)}^TV+{\rm DWC}(Q).$$ This adjustment retains feature diversity while mitigating potential size mismatch issues.