Cyanogenoid / pytorch-vqa

Strong baseline for visual question answering
238 stars 97 forks source link

concat or sum? #19

Closed ozancaglayan closed 5 years ago

ozancaglayan commented 5 years ago

Hello,

Thanks for the implementation. The paper does not detail how the LSTM encoding and the feature maps are fused altogether but only provides the Figure 2 where it says "the concatenated image features and the final state of the LSTM are then used to compute multiple attention distributions over image features". They also draw a Concat box in the diagram that receives the tiled LSTM state and the spatial feature maps as input.

I was experimenting with this idea for another task where I did the fusion with concatenation across the channel dimension but looking at your code I see that after tiling the q vector, you simply do a self.relu(v + q). Did you have some insight about this step, maybe some discussions with the authors?

Thanks!

Cyanogenoid commented 5 years ago

Concatenating two vectors v and q followed by a linear projection W is the same as projecting v and q individually with W_v and W_q, then adding them. Does this picture make it clearer? concat

ozancaglayan commented 5 years ago

yes i know about this, in fact i thought about this before posting. So your v (->512) and q (->512) transformations that you apply before the x_conv are the ones that replaces the concat+conv, then? I was just not sure about the non-lins and the interplay of Conv about this equivalence of concat vs sum in this case.

Cyanogenoid commented 5 years ago

Correct. In this case it is exactly equivalent, there are no nonlinearities affecting this and the conv is a 1x1 convolution.

ozancaglayan commented 5 years ago

thanks!