Closed tamohannes closed 1 year ago
I would modify some of the names of the dimensions to make it more clear. Something like: B: batch size (I wouldn't add this) C: context sequence lenght E: embedding dimension H: hidden dimension Q: query sequence lenght I would also make the drawings about cross-attention using a different Q and C and then just mention that self-attention uses the same input for both, query and context.
Also there are two small errors in the equation, the operation inside the softmax is $Q \cdot K^T$. The dimension there should be BxQxC. Then is $sofmax(\cdot)\cdot V$, and you end up with BxQxH.
Added 2 figures to illustrate K, Q, and V matrices formation, and showcase the attention operation. Gave explanation of the key, query, and value terms with the analogy of retrieval mechanism.