Multimodal Alignment loss for Vilbert/VisualBERT

vedanuj commented 4 years ago

🚀 Feature

Multimodal Alignment or Sentence Image Prediction loss for ViLBERT/VisualBERT

Motivation

VilBERT model uses two pretraining losses. Current MMF implementation uses masked multimodal modeling loss but not the multimodal alignment loss. Similarly VisualBERT model also uses a similar loss which they call sentence-image prediction loss. Task is to add this multimodal alignment loss to Vilbert and VisualBERT models.

Pitch

To be able to reproduce Vilbert/VisualBERT model results multimodal alignment loss should be added. Also in order to extend to retrieval downstream tasks, this multimodal alignment loss will be important.

Additional context

The task will involve adding this loss to the models, modify any dataset side changes and test the implementation is working as expected.

Priyanshiguptaaa commented 3 years ago

Hi @vedanuj @ytsheng, I would like to give this a try if the feature is still required to be worked on. If yes, is there any other information or resources apart from this thread that I need to know about?

Thanks!

apsdehal commented 3 years ago

@Priyanshiguptaaa Thanks for working on this. The PR #961 added it as a general pretraining head but didn't add it to VisualBERT and ViLBERT, so that task is still open. We would ideally want a config option in VisualBERT/ViLBERT to enable ITM loss. Please try to reuse the transformer head that was implemented in #961.

pragyasrivastava0805 commented 3 years ago

Hi @apsdehal I think I can do this. I will have to contribute this as a part of an initial screening for a research intern position. Hope you understand

Farhan-jafri commented 3 years ago

Hi, I want to work on this issue, this is my first contribution, could you please guide me

facebookresearch / mmf