Closed gyr66 closed 1 year ago
Hi @gyr66 ! Happy to know you found the repository useful.
About the attention mechanism, its tricky, but let's get it one step at a time.
1º: The attention mechanism should (theoretically) be applied to the sequence itself (hiddens), what happens here is that you apply the attention mechanism into the sequence. The sequence in this case is the GRU of the projected feature maps.
2º: I said it's tricky because I found out in my training that using attention makes convergence slower, so it takes a lot more time (epoches) to get same results, just like you, but I also found better results when using it.
3º: You probably get faster results when using only attention itself because it works just as not using attention at all. It becomes a Linear layer in an optimazation point of view
This Attention layer is something that changes the weights of the linear layer in such a way that it pays attention of specific things in the sequence (GRU output). That's why Attention
is always applied into some context
. In this repository I applied multiplicative attention, but additive attention is also something quite common to see.
So in summary:
Thanks for your detailed explanation! I am not familiar with multiplicative attention, that is probably why I am confused. Thanks a lot again!
Hi! I am very interested in this project and I have learned a lot from it. When I browse the code, I am confused by this line
x = hiddens * attention
. Should it bex = attention
instead? I tried on a captcha dataset and find that when usingx = hiddens * attention
, the accuracy is 1% after 20 epochs. When usingx = attention
, the accuracy is about 77% after 20 epochs.Thank you!