Closed nishantvishwamitra closed 4 years ago
In the MultimodalBertEncoder, the text and image embeddings order of concatentation does not match the comment:
encoder_input = torch.cat([img_embed_out, txt_embed_out], 1) # Bx(TEXT+IMG)xHID
Good catch, the comment should be (IMG+TEXT). We concatenate Images first as text was of variable length, while number of image embeddings is fixed.
In the MultimodalBertEncoder, the text and image embeddings order of concatentation does not match the comment:
encoder_input = torch.cat([img_embed_out, txt_embed_out], 1) # Bx(TEXT+IMG)xHID