Paper Review: Attention Is All You Need

Publisher

Advances in Neural Information Processing Systems 30 (NIPS 2017)

Link to The Paper

https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

Name of The Authors

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin

Year of Publication

2017

Summary

This paper is the first attempt to build a sequence-to-sequence model entirely based on attention. It sets new state-of-the-art BLEU scores for WMT English to German (by 2 BLEU score) and English to French (by 0.7 BLEU score) translations. As it eradicates the need for sequential operations (size of the sequence in RNNs, number of layers in CNNs), it can be trained much faster by using parallelization.

Contributions of The Paper

Sets new SOTA scores for English to German and English to French translation
Proposed model is way more parallelizable, i.e., faster
Proposed model can deal with long term dependencies way better than RNN or CNN
They performed an extensive parameter search to show that —
1. using 8 attention heads is a sweet spot, increasing or decreasing hurts
2. reducing attention key size hurts
3. bigger models are better
4. dropout helps, label smoothing causes worse perplexity (increases) but better BLEU score (increases)
5. Learned positional embedding results in a similar performance as sinusoidal positional embedding

Comments

It would be a disrespect to put any comment on this paper.

RAISEDAL / RAISEReadingList