#Operations for Self-Attention Layer

Hi @sacmehta, Thanks for the great work

https://github.com/apple/ml-cvnets/blob/d38a116fe134a8cd5db18670764fdaafd39a5d4f/cvnets/layers/multi_head_attention.py#L125

# number of operations in QK^T
m_qk = (seq_len * in_channels * in_channels) *

As per the code above, the MAdds for QK^T is calculated as L*C*C where L & C are sequence length and channels respectively. But as we know QK^T product actually involves calculating NxN gram matrix, means we need to compute NxN elements and the operations required for calculating each element would be C. So, shouldn't the MAdds be NxNxC instead?

Thanks, please correct me if I am wrong.

apple / ml-cvnets

#Operations for Self-Attention Layer #17