Why split the queries and keys as the input of the scaled dot product attention?

brightmart / text_classification

all kinds of text classification models and more with deep learning

MIT License

7.83k stars 2.57k forks source link

Why split the queries and keys as the input of the scaled dot product attention? #63

Closed PanXiebit closed 6 years ago

PanXiebit commented 6 years ago

https://github.com/brightmart/text_classification/blob/a01c5ab3cb11e53beb966bc362302f0679f47d94/a07_Transformer/a2_multi_head_attention.py#L68

Execise me, in the paper "Attention is all your need", the one head attention is scaled dot product attention, right? And the formation in the paper of one head attention is : $$head_{i} = Attention(QW_i^Q, KW_i^k, VW_i^V)$$

and the $W_i^Q \in R^{d_{model}\times d_k}$. ....

So, I think that it doesn't need to split the d_{model} with heads. And I think every head is d_{model}.

In the source code of tensor2tensor from Google, I haven't see the split in the implement of scaled dot product attention. Maybe it is difficult for me to read, I am a newer in nlp.

hope for your reply, thank you!

PanXiebit commented 6 years ago

em.. maybe I know the differenct between your implement with the paper. https://github.com/brightmart/text_classification/blob/a01c5ab3cb11e53beb966bc362302f0679f47d94/a07_Transformer/a2_multi_head_attention.py#L44 the units of the projection in your code is d_{model}, and in the paper is d_k, so you split them later, right?

brightmart commented 6 years ago

i think so.

发件人: PanXiebit notifications@github.com 发送时间: 2018年6月20日 20:39:05 收件人: brightmart/text_classification 抄送: Subscribed 主题: Re: [brightmart/text_classification] Why split the queries and keys as the input of the scaled dot product attention? (#63)

em.. maybe I know the differenct between your implement with the paper. https://github.com/brightmart/text_classification/blob/a01c5ab3cb11e53beb966bc362302f0679f47d94/a07_Transformer/a2_multi_head_attention.py#L44 https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbrightmart%2Ftext_classification%2Fblob%2Fa01c5ab3cb11e53beb966bc362302f0679f47d94%2Fa07_Transformer%2Fa2_multi_head_attention.py%23L44&data=02%7C01%7C%7C45cf5c3c41b745594f3b08d5d6aacff6%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636650951478602485&sdata=qnpdmGo6ZqW7z3vrEv58ypuVe8dVBVW%2B0MDMpbd3AKU%3D&reserved=0 the units of the projection in your code is d_{model}, and in the paper is d_k, so you split them later, right?

― You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbrightmart%2Ftext_classification%2Fissues%2F63%23issuecomment-398733976&data=02%7C01%7C%7C45cf5c3c41b745594f3b08d5d6aacff6%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636650951478602485&sdata=qGn%2Fjo9LLyuTEtqQPxtjpqbQEpxjrFyf1lzp%2FL3pkvI%3D&reserved=0, or mute the threadhttps://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FASuYMMbwexekqntWE3OwEwwT4v6edO6gks5t-kJpgaJpZM4UvKfX&data=02%7C01%7C%7C45cf5c3c41b745594f3b08d5d6aacff6%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636650951478602485&sdata=6RvFvhAA7wLCmc141T3gT9hrCP7DI3iYjflFId3LbWQ%3D&reserved=0.