Open johntiger1 opened 4 years ago
Also, would like to notice that you have https://github.com/facebookresearch/vilbert-multi-task/blob/master/vilbert_tasks.yml
, which should satisfy most use cases. But say we just want the weights themselves, given that we want to do arbitrary additional development
I would like to know too! To my understanding, we should be able to extract the output prior to the downstream tasks?
btw, hi @johntiger1, nice to see you here :)
If you check the code in vilbert/vilbert.py
you can directly use the output of the self.bert
layer without any classifier head. You can load the weights of the self.bert
part in another model that has a different head like LSTM or others.
@johntiger1 That is a good question, I was also checking their demo notebook and couldn't see a way to get around with. I only saw the FeatureExtractor class which extracts features from the image, but not their combination. Have you managed to work on this? thanks
@johntiger1 @johntiger1 @leyuan I guess you guys have already figured it out, but as vedanuj mentioned, if you checkout the line 652 in vilbert.py, you can see that it returns the embedding for all the image features and text features:
sequence_output_t, sequence_output_v, pooled_output_t, pooled_output_v, all_attention_mask = self.bert(...
Well, as it is mentioned in the paper, the first item in the sequence of image features is the holistic image representations (same as for text). Therefore, you can extract the embedding of the image, which is now learnt jointly by text, by
sequence_output_v[:, 0]
you can also access to the different layers by setting the output_all_encoded_layers=True, and then get the image embedding from a specific layer by:
sequence_output_v[layerno][:, 0]
@engmubarak48 @leyuan @johntiger1
Hi guys,
I wrote a step-by-step tutorial on how to set up the environment, train and test this model. I also added a section on extracting the visiolinguistic embeddings from the image-text data. https://naserian-elahe.medium.com/vilbert-a-model-for-learning-joint-representations-of-image-content-and-natural-language-47f56a313a79 I very much appreciate any comments or suggestions
From my understanding, we get visuo-linguistic embeddings using VilBert (and LXMERT and VL-Bert for that matter too). Is it possible to simply use these as a layer/feature extractor backbone for visual/linguistic tasks? For instance, if we wanted to add a linear classifier (or LSTM) on top of the VilBert embeddings, are there any provided pretrained weights?
Thanks