Closed Li-Qingyun closed 2 years ago
It is a trick to stable the training of attention modules. It has little influence on the final performance.
It is a trick to stable the training of attention modules. It has little influence on the final performance.
Thanks for your reply.
Hi~ Thanks for your excellent work! I'm confused about an operation about attention weight calculation.
In the implementation of the attention, there is a small modification, which i have not found in the paper.
The code is:
Whether or not this procedure refers to some previous studies, which i have not been read. Will doing this improve the performance?