cmsflash / efficient-attention

An implementation of the efficient attention module.
https://arxiv.org/abs/1812.01243
MIT License
272 stars 26 forks source link

About Normalization #9

Closed VoyageWang closed 1 year ago

VoyageWang commented 1 year ago

After I ran the code you provided, I compared it with the traditional self-attention mechanism. I found the results are quite different from traditional ones which are truly normalized, but the efficient one is not. Can you offer some help?

cmsflash commented 1 year ago

Hi Voyage, what do you mean by the results are quite different? If you try to run the same input through an efficient attention module and a dot-product attention module with softmax normalization, you should expect different outputs as they are not mathematically equivalent. However, we conducted extensive experiments to show that the difference does not impact performance in most cases.

VoyageWang commented 1 year ago

Thanks for your reply! Yes, I think you have perfectly answered my questions. Yes, I did a small experiment to compare how efficient it is. The results show that it is 40 times faster than dot-product attention with the same input. Unfortunately, another problem arose. After I used an efficient one as my basis of transformer encoder, the running time is slower than the traditional self-attention. It doesn't seem that it is as efficient as we expected to accelerate our algorithm. Could you help me to solve it?

cmsflash commented 1 year ago

To put it simply, the efficiency is not a free lunch. The complexity for the efficient attention (EA) is O(nc^2/h), and for dot-product attention (DA) is O(n^2ch). Therefore, when the token count (resolution) and the head count are low but the channel count is high, EA will be more costly than DA. You should review your setup to see if this explains your issue.

If so, you can try increasing the number of attention heads, increasing the number of tokens (resolution), and decrease the number of channels if it is appropriate.