target-based image representation & image-based target representation

bajixing commented 2 years ago

Hello, I've been wondering a question: The output H^'_V obtained by equation (1) in the paper is expressed as' target-based image representation ', however, in this Cross Modal Transformer module its Q comes from the image, K and V come from the text, Shouldn't the correct description be 'image-related text representation' or 'image-based text representation'? The same is true for Equation (5).

jefferyYu commented 2 years ago

Hi there,

This is a good question.

First, because we concatenate the textual context and the target together as the input text, we regard it as Contextualized Target Representation. Second, for the name of H^'_V, it depends on how you think about this question, and I think both ways are OK. Our consideration is as follows: Although Q comes from the image representation, K and V come from the contextualized target representation, the size of the output representation in the Cross Modal Transformer module is the same as the size of the image representation. Actually, each position of the output representation is a weighted sum of the contextualized target representation, which can be considered as using the contextualized target representation to represent each object of the image. Therefore, we name it as target-based image representation. Similarly, In Equation 5, we name the output representation as image-based target representation.

Hope it clarifies your concern.

bajixing commented 2 years ago

Thank you for clearing my confusion and have a nice day!

NUSTM / ITM

target-based image representation & image-based target representation #2