Open BerenLuthien opened 5 years ago
MultiHead(Q, K, V ) = Concat(head1, ..., headh)WO,In this work we employ h = 8 parallel attention layers, or heads. For each of these we use dk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.
----from Attention Is All You Need
https://github.com/codertimo/BERT-pytorch/blob/d10dc4f9d5a6f2ca74380f62039526eb7277c671/bert_pytorch/model/attention/multi_head.py#L15
Looks that self.d_k = d_model // h ---> embed size 768 dividing number of heads 12 = 64
why convert 768 dimensional [q,v,k] into 64 dimension embedding ?
Reference: http://nlp.seas.harvard.edu/2018/04/03/attention.html I put some comments on the shape: