Closed deepzlk closed 3 years ago
attn = (q @ k.transpose(-2, -1)) * self.scale attn = attn.softmax(dim=-1) attn = self.attn_drop(attn) x = (attn @ v).transpose(1, 2).reshape(B, N, C)
Although there is no concatenation in the code which implements the multi-head, the codes above actually achieves the multi-head self-attention.
It seems that Multi-Head Attention did not implement multi heads=8?