Question about Image Text Alignment

facebookresearch / mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

Other

5.5k stars 939 forks source link

Hi, I was wondering what the structure of the vector provided to the image_text_alignment field should be (https://github.com/facebookresearch/mmf/blob/7ce17a58e7b61b1bc2fc7384c1974e60967bd9fa/mmf/modules/embeddings.py#L369)?

From my understanding it's a num bounding boxes dim vector where each entry has the index of the corresponding token (outputted by the BERT tokenizer). I was wondering if you clarify the following:

what is the alignment_number?
if the feature doesn't have a corresponding token, is the entry 0?
does VisualBERT not use information like object labels outputted by the detection model (and would adding this alignment therefore boost performance)?

More context: this feature seems to be inspired by the Visual Commonsense Reasoning dataset, from the VisualBERT paper:

The dataset also provides alignments between words and bounding regions that are referenced to in the text, which we utilize by using the same position embeddings for matched words and regions.

Hi, apologies for the late reply. We don't currently use this in MMF as VCR is not implemented.

You can find more details of how it is originally created in VisualBERT in the original repo at https://github.com/uclanlp/visualbert/blob/master/dataloaders/vcr.py#L357.

does VisualBERT not use information like object labels outputted by the detection model (and would adding this alignment therefore boost performance)?

It is hard to say whether this will affect. VisualBERT doesn't use masked region modeling but ViLBERT does which tries to predict object labels for masked regions. We haven't seen strong empirical evidence that MRM is a good task for pretraining. Image-text alignment (whether image matches to corresponding task) is a good pretraining task for retrieval tasks though.

facebookresearch / mmf

Question about Image Text Alignment #676