Closed g-luo closed 1 year ago
Hi, apologies for the late reply. We don't currently use this in MMF as VCR is not implemented.
You can find more details of how it is originally created in VisualBERT in the original repo at https://github.com/uclanlp/visualbert/blob/master/dataloaders/vcr.py#L357.
does VisualBERT not use information like object labels outputted by the detection model (and would adding this alignment therefore boost performance)?
It is hard to say whether this will affect. VisualBERT doesn't use masked region modeling but ViLBERT does which tries to predict object labels for masked regions. We haven't seen strong empirical evidence that MRM is a good task for pretraining. Image-text alignment (whether image matches to corresponding task) is a good pretraining task for retrieval tasks though.
Hi, I was wondering what the structure of the vector provided to the image_text_alignment field should be (https://github.com/facebookresearch/mmf/blob/7ce17a58e7b61b1bc2fc7384c1974e60967bd9fa/mmf/modules/embeddings.py#L369)?
From my understanding it's a num bounding boxes dim vector where each entry has the index of the corresponding token (outputted by the BERT tokenizer). I was wondering if you clarify the following:
More context: this feature seems to be inspired by the Visual Commonsense Reasoning dataset, from the VisualBERT paper: