Closed wfz666 closed 1 week ago
The module used by the self.self_attn
variable is a general purpose implementation of the attention calculation which can do self-attention or cross-attention by changing the q/k/v inputs. That's also where the W_q@x calculation is done, for example.
So in this case, since they use tgt2
for all three inputs (q, k & v), it corresponds to doing self-attention. They use the same module to do cross attention by setting the q input to something different compared to k & v.
The module used by the
self.self_attn
variable is a general purpose implementation of the attention calculation which can do self-attention or cross-attention by changing the q/k/v inputs. That's also where the W_q@x calculation is done, for example.So in this case, since they use
tgt2
for all three inputs (q, k & v), it corresponds to doing self-attention. They use the same module to do cross attention by setting the q input to something different compared to k & v.
Thanks for your answer, but I still don't seem to understand. Why are q, k, and v assigned directly here instead of through matrix calculation? In particular, why is the value of v also implemented by direct assignment, rather than something like q@k? My understanding is: Is this just initialization, and backpropagation is needed to modify it later? I don't know much about ViT, so the question may be stupid, sorry.
Ah, I get what you mean, thanks!
def _forward_sa(self, tgt, query_pos):
Self-Attention
I am confused about this code of the self-attention mechanism. Because from the formula, q, k are w_q@x and W_k@x, and v is generated by q, k, why is it directly assigned here?