facebookresearch / sam2

The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Apache License 2.0
12.14k stars 1.1k forks source link

Ask about attention #403

Closed wfz666 closed 1 week ago

wfz666 commented 2 weeks ago

def _forward_sa(self, tgt, query_pos):

Self-Attention

    tgt2 = self.norm1(tgt)
    q = k = tgt2 + query_pos if self.pos_enc_at_attn else tgt2
    tgt2 = self.self_attn(q, k, v=tgt2)
    tgt = tgt + self.dropout1(tgt2)
    return tgt

I am confused about this code of the self-attention mechanism. Because from the formula, q, k are w_q@x and W_k@x, and v is generated by q, k, why is it directly assigned here?

heyoeyo commented 2 weeks ago

The module used by the self.self_attn variable is a general purpose implementation of the attention calculation which can do self-attention or cross-attention by changing the q/k/v inputs. That's also where the W_q@x calculation is done, for example.

So in this case, since they use tgt2 for all three inputs (q, k & v), it corresponds to doing self-attention. They use the same module to do cross attention by setting the q input to something different compared to k & v.

wfz666 commented 1 week ago

The module used by the self.self_attn variable is a general purpose implementation of the attention calculation which can do self-attention or cross-attention by changing the q/k/v inputs. That's also where the W_q@x calculation is done, for example.

So in this case, since they use tgt2 for all three inputs (q, k & v), it corresponds to doing self-attention. They use the same module to do cross attention by setting the q input to something different compared to k & v.

Thanks for your answer, but I still don't seem to understand. Why are q, k, and v assigned directly here instead of through matrix calculation? In particular, why is the value of v also implemented by direct assignment, rather than something like q@k? My understanding is: Is this just initialization, and backpropagation is needed to modify it later? I don't know much about ViT, so the question may be stupid, sorry.

wfz666 commented 1 week ago

Ah, I get what you mean, thanks!