According to your code for x-transformer, you construct a weighted sum as: value2 = torch.matmul(alpha_spatial, value2)
Then you multiply it with alpha_channel(from squeeze-excite) and the value1. However, in your paper you show that you first multiply value1 and value 2 element-wise first, and then take the weighted sum between the multiplication result and the attention weights.
May you please clarify
Hello, Thanks for your code
According to your code for x-transformer, you construct a weighted sum as:
value2 = torch.matmul(alpha_spatial, value2)
Then you multiply it withalpha_channel
(from squeeze-excite) and thevalue1
. However, in your paper you show that you first multiplyvalue1
andvalue 2
element-wise first, and then take the weighted sum between the multiplication result and the attention weights. May you please clarify