I would like to use the multimodal alignment task in VisualBERT, ViLBERT and MMBT.
According to this issue 1 this still needs to be implemented.
But apparently something similar was already provided here 2. Could I use 2 as an orientation for 1?
There is also a image_text_alignment tensor in the model definition of MMBT and VisualBERT. What is the use for that?
Would be very helpful if someone could explain what needs to be done in order to use the multimodal alignment task with the three models.
❓ Questions and Help
I would like to use the multimodal alignment task in VisualBERT, ViLBERT and MMBT.
According to this issue 1 this still needs to be implemented. But apparently something similar was already provided here 2. Could I use 2 as an orientation for 1?
There is also a
image_text_alignment
tensor in the model definition of MMBT and VisualBERT. What is the use for that?Would be very helpful if someone could explain what needs to be done in order to use the multimodal alignment task with the three models.