What does Unimodal Pretraining for VisualBERT mean?

facebookresearch / mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

https://mmf.sh/

Other

5.5k stars 939 forks source link

What does Unimodal Pretraining for VisualBERT mean? #670

Closed ritwickchaudhry closed 4 years ago

ritwickchaudhry commented 4 years ago

❓ Questions and Help

We're trying to extend the VisualBERT model to include extra features for each object for the Hateful Memes Challenge. I'm unable to understand what exactly the unimodal pre-training (config file path: projects/hateful_memes/configs/visual_bert/direct.yaml) for VisualBERT means.

In particular, (i) What parameters are included in the pre-trained weights (I want to know if I can still use the pre-trained weights for the textual part, and train the image part) (ii) What was the Pre-training task for the unimodal pre-training.

vedanuj commented 4 years ago

Unimodal pretraining means the base model is initialized with BERT pretrained weights. For this we initialize the weights from bert-base-uncased model from Huggingface transformers library.

ritwickchaudhry commented 4 years ago

So does that mean that the pre-trained weights are only used for the textual part? In other words, if we replace the visual object embeddings with some other embeddings, it should work fine?

vedanuj commented 4 years ago

So does that mean that the pre-trained weights are only used for the textual part?

Not exactly. The whole transformer model is initialized with BERT pretrained weights. So both image and text use that depending on the model architecture. Without initializing these models with BERT weigths, they do not perform well when trained from scratch.

if we replace the visual object embeddings with some other embeddings, it should work fine?

You can replace the visual embeddings with other embeddings and it should work. What type of embeddings are you trying to use?

ritwickchaudhry commented 4 years ago

We are trying to concatenate the visual embeddings from a detector with embeddings from a concept graph like ConceptNet. So the size of the embeddings would change too.

vedanuj commented 4 years ago

Yes that should be fine and should work. Let us know if you face any issues.

ritwickchaudhry commented 4 years ago

Certainly will do. Thanks a lot for your help!