question on SCAtt - Githubissues

Hello, Thanks for your code

According to your code for x-transformer, you construct a weighted sum as: value2 = torch.matmul(alpha_spatial, value2) Then you multiply it with alpha_channel(from squeeze-excite) and the value1. However, in your paper you show that you first multiply value1 and value 2 element-wise first, and then take the weighted sum between the multiplication result and the attention weights. May you please clarify

JDAI-CV / image-captioning

question on SCAtt #10