facebookresearch / mmbt

Supervised Multimodal Bitransformers for Classifying Images and Text
Other
243 stars 52 forks source link

Order of text and image embeddings concatenation does not match the comment #2

Closed nishantvishwamitra closed 4 years ago

nishantvishwamitra commented 4 years ago

In the MultimodalBertEncoder, the text and image embeddings order of concatentation does not match the comment:

encoder_input = torch.cat([img_embed_out, txt_embed_out], 1) # Bx(TEXT+IMG)xHID

suvrat96 commented 4 years ago

Good catch, the comment should be (IMG+TEXT). We concatenate Images first as text was of variable length, while number of image embeddings is fixed.