Self-Attention

    tgt2 = self.norm1(tgt)
    q = k = tgt2 + query_pos if self.pos_enc_at_attn else tgt2
    tgt2 = self.self_attn(q, k, v=tgt2)
    tgt = tgt + self.dropout1(tgt2)
    return tgt

I am confused about this code of the self-attention mechanism. Because from the formula, q, k are w_q@x and W_k@x, and v is generated by q, k, why is it directly assigned here?

heyoeyo commented 2 weeks ago

The module used by the self.self_attn variable is a general purpose implementation of the attention calculation which can do self-attention or cross-attention by changing the q/k/v inputs. That's also where the W_q@x calculation is done, for example.

So in this case, since they use tgt2 for all three inputs (q, k & v), it corresponds to doing self-attention. They use the same module to do cross attention by setting the q input to something different compared to k & v.

wfz666 commented 1 week ago

The module used by the self.self_attn variable is a general purpose implementation of the attention calculation which can do self-attention or cross-attention by changing the q/k/v inputs. That's also where the W_q@x calculation is done, for example.

So in this case, since they use tgt2 for all three inputs (q, k & v), it corresponds to doing self-attention. They use the same module to do cross attention by setting the q input to something different compared to k & v.

Thanks for your answer, but I still don't seem to understand. Why are q, k, and v assigned directly here instead of through matrix calculation? In particular, why is the value of v also implemented by direct assignment, rather than something like q@k? My understanding is: Is this just initialization, and backpropagation is needed to modify it later? I don't know much about ViT, so the question may be stupid, sorry.

wfz666 commented 1 week ago

Ah, I get what you mean, thanks!

facebookresearch / sam2

Ask about attention #403

Self-Attention