How to apply scaled dot-product attention by efficient attention?

wangyue7777 commented 2 years ago

Hi,

Would you please tell me how to use efficient attention to be similar to the scaled dot-prodcut attention?

I notice that you apply softmax for both query and key. So is it right to set temperature = (d_k) * 0.25 and apply key = f.softmax(keys[ :, i head_key_channels: (i + 1) head_key_channels, : ] / temperature, dim=2) and query = f.softmax(queries[ :, i head_key_channels: (i + 1) * head_key_channels, : ] / temperature, dim=1) for both query and key in your code to make it similar to the scaled dot-product attention?

Thank you!

cmsflash commented 2 years ago

Hi Yue, I believe your implementation is correct. Please don't hesitate to reach out to me if you encounter further problems.

wangyue7777 commented 2 years ago

Hi,

Thank you for your help!

cmsflash / efficient-attention

How to apply scaled dot-product attention by efficient attention? #5