Closed vedanuj closed 1 year ago
Hi @vedanuj @ytsheng, I would like to give this a try if the feature is still required to be worked on. If yes, is there any other information or resources apart from this thread that I need to know about?
Thanks!
@Priyanshiguptaaa Thanks for working on this. The PR #961 added it as a general pretraining head but didn't add it to VisualBERT and ViLBERT, so that task is still open. We would ideally want a config option in VisualBERT/ViLBERT to enable ITM loss. Please try to reuse the transformer head that was implemented in #961.
Hi @apsdehal I think I can do this. I will have to contribute this as a part of an initial screening for a research intern position. Hope you understand
Hi, I want to work on this issue, this is my first contribution, could you please guide me
🚀 Feature
Multimodal Alignment or Sentence Image Prediction loss for ViLBERT/VisualBERT
Motivation
VilBERT model uses two pretraining losses. Current MMF implementation uses
masked multimodal modeling
loss but not themultimodal alignment
loss. Similarly VisualBERT model also uses a similar loss which they callsentence-image prediction
loss. Task is to add this multimodal alignment loss to Vilbert and VisualBERT models.Pitch
To be able to reproduce Vilbert/VisualBERT model results
multimodal alignment
loss should be added. Also in order to extend to retrieval downstream tasks, thismultimodal alignment
loss will be important.Additional context
The task will involve adding this loss to the models, modify any dataset side changes and test the implementation is working as expected.