Closed PanXiebit closed 6 years ago
em.. maybe I know the differenct between your implement with the paper.
https://github.com/brightmart/text_classification/blob/a01c5ab3cb11e53beb966bc362302f0679f47d94/a07_Transformer/a2_multi_head_attention.py#L44
the units of the projection in your code is d_{model}
, and in the paper is d_k
, so you split them later, right?
i think so.
发件人: PanXiebit notifications@github.com 发送时间: 2018年6月20日 20:39:05 收件人: brightmart/text_classification 抄送: Subscribed 主题: Re: [brightmart/text_classification] Why split the queries and keys as the input of the scaled dot product attention? (#63)
em.. maybe I know the differenct between your implement with the paper. https://github.com/brightmart/text_classification/blob/a01c5ab3cb11e53beb966bc362302f0679f47d94/a07_Transformer/a2_multi_head_attention.py#L44https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbrightmart%2Ftext_classification%2Fblob%2Fa01c5ab3cb11e53beb966bc362302f0679f47d94%2Fa07_Transformer%2Fa2_multi_head_attention.py%23L44&data=02%7C01%7C%7C45cf5c3c41b745594f3b08d5d6aacff6%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636650951478602485&sdata=qnpdmGo6ZqW7z3vrEv58ypuVe8dVBVW%2B0MDMpbd3AKU%3D&reserved=0 the units of the projection in your code is d_{model}, and in the paper is d_k, so you split them later, right?
― You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbrightmart%2Ftext_classification%2Fissues%2F63%23issuecomment-398733976&data=02%7C01%7C%7C45cf5c3c41b745594f3b08d5d6aacff6%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636650951478602485&sdata=qGn%2Fjo9LLyuTEtqQPxtjpqbQEpxjrFyf1lzp%2FL3pkvI%3D&reserved=0, or mute the threadhttps://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FASuYMMbwexekqntWE3OwEwwT4v6edO6gks5t-kJpgaJpZM4UvKfX&data=02%7C01%7C%7C45cf5c3c41b745594f3b08d5d6aacff6%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636650951478602485&sdata=6RvFvhAA7wLCmc141T3gT9hrCP7DI3iYjflFId3LbWQ%3D&reserved=0.
https://github.com/brightmart/text_classification/blob/a01c5ab3cb11e53beb966bc362302f0679f47d94/a07_Transformer/a2_multi_head_attention.py#L68
Execise me, in the paper "Attention is all your need", the one head attention is scaled dot product attention, right? And the formation in the paper of one head attention is :
$$head_{i} = Attention(QW_i^Q, KW_i^k, VW_i^V)$$
and the
$W_i^Q \in R^{d_{model}\times d_k}$. ....
So, I think that it doesn't need to split the
d_{model}
with heads. And I think every head isd_{model}
.In the source code of tensor2tensor from Google, I haven't see the split in the implement of scaled dot product attention. Maybe it is difficult for me to read, I am a newer in nlp.
hope for your reply, thank you!