Closed ritwickchaudhry closed 4 years ago
Unimodal pretraining means the base model is initialized with BERT pretrained weights. For this we initialize the weights from bert-base-uncased model from Huggingface transformers library.
So does that mean that the pre-trained weights are only used for the textual part? In other words, if we replace the visual object embeddings with some other embeddings, it should work fine?
So does that mean that the pre-trained weights are only used for the textual part?
Not exactly. The whole transformer model is initialized with BERT pretrained weights. So both image and text use that depending on the model architecture. Without initializing these models with BERT weigths, they do not perform well when trained from scratch.
if we replace the visual object embeddings with some other embeddings, it should work fine?
You can replace the visual embeddings with other embeddings and it should work. What type of embeddings are you trying to use?
We are trying to concatenate the visual embeddings from a detector with embeddings from a concept graph like ConceptNet. So the size of the embeddings would change too.
Yes that should be fine and should work. Let us know if you face any issues.
Certainly will do. Thanks a lot for your help!
❓ Questions and Help
We're trying to extend the VisualBERT model to include extra features for each object for the Hateful Memes Challenge. I'm unable to understand what exactly the unimodal pre-training (config file path:
projects/hateful_memes/configs/visual_bert/direct.yaml
) for VisualBERT means.In particular, (i) What parameters are included in the pre-trained weights (I want to know if I can still use the pre-trained weights for the textual part, and train the image part) (ii) What was the Pre-training task for the unimodal pre-training.