facebookresearch / vilbert-multi-task

Multi Task Vision and Language
MIT License
800 stars 180 forks source link

How to use VilBert pretrained for Caption-Based Image Retrieval #54

Open JoanFM opened 4 years ago

JoanFM commented 4 years ago

I would like to know if the pre-trained model given by this link (https://dl.fbaipublicfiles.com/vilbert-multi-task/pretrained_model.bin) can be used for Caption-Based Image Retrieval.

My first guess is that I can load the model using (not sure if the configuration file is the proper one):

config = BertConfig.from_json_file('config/bert_base_6layer_6conect.json')
model = VILBertForVLTasks.from_pretrained('pretrained_model.bin', config=config) 

Afterwards I have seen digging in the code that running the inner bert model should give the sequence outputs for text and for image.

I have several questions:

I hope I made myself clear

Thank you very much

rom1504 commented 3 years ago

Hi, I'm wondering whether you tried it and have some insights to provide? I'm interested by the same thing

JoanFM commented 3 years ago

Hi, I'm wondering whether you tried it and have some insights to provide? I'm interested by the same thing

Hey @rom1504 I did not get any feedback, so I did not proceed with this paper, I found another paper to do Caption-Based Image Retrieval https://github.com/fartashf/vsepp with really nice and easy implementation. (The results claimed in the paper are not so nice but it is good for a first implementation of such a system...)

enaserianhanzaei commented 3 years ago

@JoanFM @rom1504

Hi guys,

I wrote a step-by-step tutorial on how to set up the environment, train and test this model. I also added a section on extracting the visiolinguistic embeddings from the image-text data. https://naserian-elahe.medium.com/vilbert-a-model-for-learning-joint-representations-of-image-content-and-natural-language-47f56a313a79 I very much appreciate any comments or suggestions

shivangibithel commented 2 years ago

@enaserianhanzaei I followed your tutorial for Image Retrieval but I am getting very low final values of recall. Do you have any idea what could have been wrong here?

image