[feat] Break down attention mechanism. Add figures

I would modify some of the names of the dimensions to make it more clear. Something like: B: batch size (I wouldn't add this) C: context sequence lenght E: embedding dimension H: hidden dimension Q: query sequence lenght I would also make the drawings about cross-attention using a different Q and C and then just mention that self-attention uses the same input for both, query and context.

Also there are two small errors in the equation, the operation inside the softmax is $Q \cdot K^T$. The dimension there should be BxQxC. Then is $sofmax(\cdot)\cdot V$, and you end up with BxQxH.

LxMLS / lxmls-guide

[feat] Break down attention mechanism. Add figures #147