Attention mechanism - Githubissues

gyr66 commented 1 year ago

Hi! I am very interested in this project and I have learned a lot from it. When I browse the code, I am confused by this line x = hiddens * attention . Should it be x = attention instead? I tried on a captcha dataset and find that when using x = hiddens * attention, the accuracy is 1% after 20 epochs. When using x = attention, the accuracy is about 77% after 20 epochs.

Thank you!

GabrielDornelles commented 1 year ago

Hi @gyr66 ! Happy to know you found the repository useful.

About the attention mechanism, its tricky, but let's get it one step at a time.

1º: The attention mechanism should (theoretically) be applied to the sequence itself (hiddens), what happens here is that you apply the attention mechanism into the sequence. The sequence in this case is the GRU of the projected feature maps.
2º: I said it's tricky because I found out in my training that using attention makes convergence slower, so it takes a lot more time (epoches) to get same results, just like you, but I also found better results when using it.
3º: You probably get faster results when using only attention itself because it works just as not using attention at all. It becomes a Linear layer in an optimazation point of view

This Attention layer is something that changes the weights of the linear layer in such a way that it pays attention of specific things in the sequence (GRU output). That's why Attention is always applied into some context. In this repository I applied multiplicative attention, but additive attention is also something quite common to see.

So in summary:

When using attention you need much more epoches to make the model achieve similar results of not using it
Using attention usually makes final accuracy better

gyr66 commented 1 year ago

Thanks for your detailed explanation! I am not familiar with multiplicative attention, that is probably why I am confused. Thanks a lot again!

GabrielDornelles / pytorch-ocr

Attention mechanism #3