fawazsammani / nlxgpt

NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks, CVPR 2022 (Oral)
44 stars 10 forks source link

Pre-training step #6

Closed dschaehi closed 2 years ago

dschaehi commented 2 years ago

Hi @fawazsammani,

Since the repo provides pretrained models, but not a script for pretraining, I am wondering what split to choose to pretrain on the four datasets mentioned in the paper (i.e., coco captions, flickr30k, VG and image paragraph captioning). I think this is not well described in the paper. Would I need to split the datasets for pre-training or can I pre-train the model on the entire datasets without splitting?

fawazsammani commented 2 years ago

@dschaehi I will provide the splits and pretrain script tonight. You cannot simply pretrain the model on the whole dataset. Because VQA-X test set is taken from COCO images (and possibly Visual Genome in which 50% of it is COCO), and e-SNLI-VE test set is taken from Flickr30k. Both COCO and Flickr are used for pretraining.

Pretrained VL-models always excluded these test images from the pretraining dataset, because the finetuning uses the same dataset, just in a different way. For example, it is absolutely wrong to pretrain a VL-model with the masked language modelling objective, where the model sees the whole caption (except the masked words which are randomly chosen), and then to later finetune this VL-model on the image captioning task. Because the pretraining step already saw the test caption which the finetuned model should predict. In summary, when the same dataset is used for pretraining and finetuning, regardless of what the task is, the finetuning test dataset should be excluded from the pretraining dataset.

In our case, the pretraining dataset (image captioning) is completely different with the finetuning dataset (Natural Language Explanations), it's just that the images are shared. Whether or not it is fair to use the finetuning test images during pretraining is a debate. But the general thing, is that the test dataset should be something the model has never seen before, and has no idea about. Essentially, allowing the model to understand these finetuning NLE test images through a different way (e.g. image captioning) is distilling knowledge about these images in the pretrained model. Therefore, pretraining with the NLE test images is wrong, and we avoided this.

Hope you are clear now Regards

dschaehi commented 2 years ago

Hi @fawazsammani, thank you again for your answer!

@dschaehi I will provide the splits and pretrain script tonight.

This is great. Thanks!

In our case, the pretraining dataset (image captioning) is completely different with the finetuning dataset (Natural Language Explanations), it's just that the images are shared. Whether or not it is fair to use the finetuning test images during pretraining is a debate.

I find it a bit confusing to follow. If I understand it correctly, only the images from the fine-tuning datasets are shared with the pre-training datasets, which is OK (but debatable) because they are for two different tasks, i.e., image captioning vs NLE?

fawazsammani commented 2 years ago

@dschaehi correct, but we avoid this.

Regards

dschaehi commented 2 years ago

Hi @fawazsammani, thanks for the clarification so far. In your first reply to my question, you wanted to provide a script for pre-training in the night of the day you replied. If you haven't uploaded the script yet, would you do this soon? This would be very helpful to reproduce the results and to learn more about the details about the pre-training step. Thanks!

fawazsammani commented 2 years ago

Hi again @dschaehi I'm really sorry, I had forgotten to post it last time. Im currently on vacation, and unfortunately I do not have my office computer with me.... I will be back on Friday and post it directly.

However, if you require the pretrained model, it is already available in the Models section. I do not see any need for training it again and wasting computational resources if we already did :)

Regards Fawaz

dschaehi commented 2 years ago

Hi @fawazsammani, Thanks for getting back to this. Please enjoy your vacation first. I am just interested in how such a pre-training works in general as I'd like to come up with a new model as well. Regardless of this, I think a fully reproducible code should contain all the steps: pre-training, hyper-parameter tuning, fine-turning, random seed, etc.

fawazsammani commented 2 years ago

Hello @dschaehi , Sorry for the delay again. I have now uploaded the pretrain script. The pretrain annotations are also here. As mentioned in the earlier discussion, we use the "filtered" annotations, with prefix filtered_. The split sizes are also provided and compared in a txt file. I am uploading the unfiltered annotations also so that if you need them for a project different than NLE (a project which does not share images between pretraining and finetuning). Please also note that for e-SNLI-VE we do not use the pretraining model as initialization for the finetuning. So the complete Flickr30k can be included in the pretraining as well.

Feel free to open this issue if you have any other doubts.

Regards Fawaz

dschaehi commented 2 years ago

Great! Thank you very much!