This study revisits the Self-Attention of a Transformer. Self-Attention uses dot-products to take interactions between tokens. In this study, they calculate the attention weight of each token independently or treat attention weight as a training parameter instead of dot-product, but they show the performance is competitive.
TL;DR
This study revisits the Self-Attention of a Transformer. Self-Attention uses dot-products to take interactions between tokens. In this study, they calculate the attention weight of each token independently or treat attention weight as a training parameter instead of dot-product, but they show the performance is competitive.
Why it matters:
Paper URL
https://arxiv.org/abs/2005.00743
Submission Dates(yyyy/mm/dd)
Authors and institutions
Methods
Results
Comments