facebookresearch / mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
https://mmf.sh/
Other
5.5k stars 939 forks source link

Question about Image Text Alignment #676

Closed g-luo closed 1 year ago

g-luo commented 4 years ago

Hi, I was wondering what the structure of the vector provided to the image_text_alignment field should be (https://github.com/facebookresearch/mmf/blob/7ce17a58e7b61b1bc2fc7384c1974e60967bd9fa/mmf/modules/embeddings.py#L369)?

From my understanding it's a num bounding boxes dim vector where each entry has the index of the corresponding token (outputted by the BERT tokenizer). I was wondering if you clarify the following:

More context: this feature seems to be inspired by the Visual Commonsense Reasoning dataset, from the VisualBERT paper:

The dataset also provides alignments between words and bounding regions that are referenced to in the text, which we utilize by using the same position embeddings for matched words and regions.

apsdehal commented 4 years ago

Hi, apologies for the late reply. We don't currently use this in MMF as VCR is not implemented.

You can find more details of how it is originally created in VisualBERT in the original repo at https://github.com/uclanlp/visualbert/blob/master/dataloaders/vcr.py#L357.

does VisualBERT not use information like object labels outputted by the detection model (and would adding this alignment therefore boost performance)?

It is hard to say whether this will affect. VisualBERT doesn't use masked region modeling but ViLBERT does which tries to predict object labels for masked regions. We haven't seen strong empirical evidence that MRM is a good task for pretraining. Image-text alignment (whether image matches to corresponding task) is a good pretraining task for retrieval tasks though.