facebookresearch / vilbert-multi-task

Multi Task Vision and Language
MIT License
800 stars 180 forks source link

Transfer learning with vilbert #19

Open johntiger1 opened 4 years ago

johntiger1 commented 4 years ago

From my understanding, we get visuo-linguistic embeddings using VilBert (and LXMERT and VL-Bert for that matter too). Is it possible to simply use these as a layer/feature extractor backbone for visual/linguistic tasks? For instance, if we wanted to add a linear classifier (or LSTM) on top of the VilBert embeddings, are there any provided pretrained weights?

Thanks

johntiger1 commented 4 years ago

Also, would like to notice that you have https://github.com/facebookresearch/vilbert-multi-task/blob/master/vilbert_tasks.yml, which should satisfy most use cases. But say we just want the weights themselves, given that we want to do arbitrary additional development

leyuan commented 4 years ago

I would like to know too! To my understanding, we should be able to extract the output prior to the downstream tasks?

btw, hi @johntiger1, nice to see you here :)

vedanuj commented 4 years ago

If you check the code in vilbert/vilbert.py you can directly use the output of the self.bert layer without any classifier head. You can load the weights of the self.bert part in another model that has a different head like LSTM or others.

https://github.com/facebookresearch/vilbert-multi-task/blob/14dc9423177455b85af28c0ffa92b1b775ebff96/vilbert/vilbert.py#L1652

engmubarak48 commented 4 years ago

@johntiger1 That is a good question, I was also checking their demo notebook and couldn't see a way to get around with. I only saw the FeatureExtractor class which extracts features from the image, but not their combination. Have you managed to work on this? thanks

enaserianhanzaei commented 3 years ago

@johntiger1 @johntiger1 @leyuan I guess you guys have already figured it out, but as vedanuj mentioned, if you checkout the line 652 in vilbert.py, you can see that it returns the embedding for all the image features and text features:

sequence_output_t, sequence_output_v, pooled_output_t, pooled_output_v, all_attention_mask = self.bert(...

Well, as it is mentioned in the paper, the first item in the sequence of image features is the holistic image representations (same as for text). Therefore, you can extract the embedding of the image, which is now learnt jointly by text, by

sequence_output_v[:, 0]

you can also access to the different layers by setting the output_all_encoded_layers=True, and then get the image embedding from a specific layer by:

sequence_output_v[layerno][:, 0]

enaserianhanzaei commented 3 years ago

@engmubarak48 @leyuan @johntiger1

Hi guys,

I wrote a step-by-step tutorial on how to set up the environment, train and test this model. I also added a section on extracting the visiolinguistic embeddings from the image-text data. https://naserian-elahe.medium.com/vilbert-a-model-for-learning-joint-representations-of-image-content-and-natural-language-47f56a313a79 I very much appreciate any comments or suggestions