Open lartpang opened 1 year ago
Hi @lartpang, Thank you for your insights. SwiftFormer and MobileViT2 are similar in computing the interactions somehow, we already shown that in the attention comparison's figure. However, there are two major differences:
(1) We are built over Additive Attention, where you have learnable weights to learn where to attend "self.w_g". There is no learnable weights inside the linear attention of MobileViT2.
(2) We eliminate the need of a third interaction "We called it in the paper KV interactions". In MobileViT-2, they share the attention weights "context vector" by using a third branch "V". In our case, we revise this interaction and replace it by linear transformation and "Skip Connection" with the Q matrix. The skip connection acts as sharing the global context weights with the input 'Q', instead of having a third branch.
To summarize, there is common factor between them and we already showed that in the attention comparison's figure, but there are two major differences.
I hope it is clear now.
Best regards, Abdelrahman.
Although the concept of "value" does not appear in the paper description and code implementation, it is actually very similar to the interaction form in MobileVit-V2.
As shown below, I have commented and organized the author's code.
As we can see, this is actually implicitly incorporating the interaction of Q and K into Q's own transformation. The "key" in the code is more like "value".