Scaled-dot product attention에서 scale을 하는 목적이 무엇일까?

eubinecto / k4ji_ai

4명의 김씨, 한명의 진씨, 한명의 임씨가 모여서 인공지능을 공부하고 있습니다.

13 stars 0 forks source link

Scaled-dot product attention에서 scale을 하는 목적이 무엇일까? #41

Open eubinecto opened 4 years ago

eubinecto commented 4 years ago

3.2.1 scaled dot product attention

공식에서, d_k에 왜 루트를 씌우는지에 대해서는 설명을 하지만 (기울기 소실 현상을 덜기 위해), 왜 그런 scalar항이 필요한지에 대한 설명은 부족한 것 같습니다.

예전에 논문 스터디를 하면서, 그리고 Sesame street 논문 스터디를 하면서. 그때 정리했던 것을 한번 모아보자.

teang1995 commented 4 years ago

The basic attention mechanism is simply a dot product between the query and the key. The size of the dot product tends to grow with the dimensionality of the query and key vectors though, so the Transformer rescales the dot product to prevent it from exploding into huge values.

-reference : https://mlexplained.com/2017/12/29/attention-is-all-you-need-explained/#:~:text=Scaled%20Dot%20Product%20Attention&text=are%20the%20matrices%20of%20keys,the%20query%20and%20the%20key.&text=%23%20scale%20the%20dot%20products%20by,why%20we%20do%20this!)

eubinecto commented 2 years ago

https://www.notion.so/2dd4ff87173244af8d2cd3242f6976f6#1e44ada3416a44339206a4cc8ea9b9c1

이 링크에서 정리를 했었음. 나중에.. 이 한곳에 모아서 한번 정리를 해보자.

eubinecto commented 2 years ago

엇... 아무도 신경쓰지 않을 것 같아서 k4ji_ai 리포를 전부 이 리포에 이전하려고 했는데.

기록을 해두면 누군가에겐 영감이 되고 도움이 되는구나

그렇다면... 나머지 이슈는 굳이 여기로 transfer까지 할 필요는 없을 것 같다.

불완전해보여도. 많이 부족해보여도. 과거의 기록은 과거에. 앞으로의 기록은 미래를 바라보며 하자. 굳이 과거를 바꾸려고 하지 말고! :)