Closed ozancaglayan closed 5 years ago
Concatenating two vectors v and q followed by a linear projection W is the same as projecting v and q individually with W_v and W_q, then adding them. Does this picture make it clearer?
yes i know about this, in fact i thought about this before posting. So your v (->512) and q (->512) transformations that you apply before the x_conv
are the ones that replaces the concat+conv, then?
I was just not sure about the non-lins and the interplay of Conv about this equivalence of concat vs sum in this case.
Correct. In this case it is exactly equivalent, there are no nonlinearities affecting this and the conv is a 1x1 convolution.
thanks!
Hello,
Thanks for the implementation. The paper does not detail how the LSTM encoding and the feature maps are fused altogether but only provides the Figure 2 where it says "the concatenated image features and the final state of the LSTM are then used to compute multiple attention distributions over image features". They also draw a
Concat
box in the diagram that receives the tiled LSTM state and the spatial feature maps as input.I was experimenting with this idea for another task where I did the fusion with concatenation across the channel dimension but looking at your code I see that after tiling the
q
vector, you simply do aself.relu(v + q)
. Did you have some insight about this step, maybe some discussions with the authors?Thanks!