How to use VilBert pretrained for Caption-Based Image Retrieval

JoanFM commented 4 years ago

I would like to know if the pre-trained model given by this link (https://dl.fbaipublicfiles.com/vilbert-multi-task/pretrained_model.bin) can be used for Caption-Based Image Retrieval.

My first guess is that I can load the model using (not sure if the configuration file is the proper one):

config = BertConfig.from_json_file('config/bert_base_6layer_6conect.json')
model = VILBertForVLTasks.from_pretrained('pretrained_model.bin', config=config)

Afterwards I have seen digging in the code that running the inner bert model should give the sequence outputs for text and for image.

I have several questions:

As explained in the paper in page 4, how can I extract from these sequences the output hIMG and hCLS, can I assume they are the first one in each corresponding sequences?
Since the training aims to have a proper prediction on wether these two representations are aligned, can we expect that image embeddings (hIMG) and text embeddings (hCLS) to have large cosine similarity, (or any relation to other distance metric)?
Would the model fail if no text or no image is not provided? I would like to use it to extract one-feature-or-the-other but not providing both inputs.
Does the model expect to have a complete input image and it handles inside the object detection? Or does it expect the meaningful regions to be extracted as a preprocessing step? If it is expected to be called inside the model, what is supposed to be the image_loc parameter?

I hope I made myself clear

Thank you very much

rom1504 commented 3 years ago

Hi, I'm wondering whether you tried it and have some insights to provide? I'm interested by the same thing

JoanFM commented 3 years ago

Hi, I'm wondering whether you tried it and have some insights to provide? I'm interested by the same thing

Hey @rom1504 I did not get any feedback, so I did not proceed with this paper, I found another paper to do Caption-Based Image Retrieval https://github.com/fartashf/vsepp with really nice and easy implementation. (The results claimed in the paper are not so nice but it is good for a first implementation of such a system...)

enaserianhanzaei commented 3 years ago

@JoanFM @rom1504

Hi guys,

I wrote a step-by-step tutorial on how to set up the environment, train and test this model. I also added a section on extracting the visiolinguistic embeddings from the image-text data. https://naserian-elahe.medium.com/vilbert-a-model-for-learning-joint-representations-of-image-content-and-natural-language-47f56a313a79 I very much appreciate any comments or suggestions

shivangibithel commented 2 years ago

@enaserianhanzaei I followed your tutorial for Image Retrieval but I am getting very low final values of recall. Do you have any idea what could have been wrong here?

facebookresearch / vilbert-multi-task

How to use VilBert pretrained for Caption-Based Image Retrieval #54