JDAI-CV / image-captioning

Implementation of 'X-Linear Attention Networks for Image Captioning' [CVPR 2020]
268 stars 52 forks source link

question on SCAtt #10

Closed homelifes closed 4 years ago

homelifes commented 4 years ago

Hello, Thanks for your code

According to your code for x-transformer, you construct a weighted sum as: value2 = torch.matmul(alpha_spatial, value2) Then you multiply it with alpha_channel(from squeeze-excite) and the value1. However, in your paper you show that you first multiply value1 and value 2 element-wise first, and then take the weighted sum between the multiplication result and the attention weights. May you please clarify

Panda-Peter commented 4 years ago

Both of the two implementations are equivalent.