facebookresearch / mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
https://mmf.sh/
Other
5.5k stars 939 forks source link

Questions about COCO features in zoo #645

Closed g-luo closed 4 years ago

g-luo commented 4 years ago

❓ Questions and Help

Hello, I was just wondering what the models used to extract the coco_trainval2014 features were trained on (for both the resnet-152 and detectron versions)? Ie, were these models ever trained on the COCO dataset?

I seem to be getting data leakage where my validation data (split from the COCO training set) performs much better than test (for a visual bert model) when using the detectron features. My hunch is that the feature extraction model may have been trained on COCO, which would result in leakage.

g-luo commented 4 years ago

Resolved. They seem to be trained on Visual Genome as seen in (https://arxiv.org/pdf/2004.08744.pdf)

"We extract 2048D region based image features from fc6 layer of a ResNeXT-152 based Faster- RCNN model [42, 54] trained on Visual Genome [22] with the attribute prediction loss following [3]"

apsdehal commented 4 years ago

Yes, they are trained on VisualGenome. VisualGenome does have a small overlap with COCO 2014 validation set.

g-luo commented 4 years ago

Thanks!

g-luo commented 4 years ago

@apsdehal I had another question -- what is the difference between VisualBERT with a pretraining vs classification head?

Additionally, what does it mean if I were to run training on a VisualBERT model with a classification head, with no resume file / model inputted. Does this train every layer of the model from scratch or just the last layer?

For some reason when I run with a classification head, I get oddly high accuracies. On the other hand, with a pretraining head I start out with a total loss = 0 which looks weird.

vedanuj commented 4 years ago

I had another question -- what is the difference between VisualBERT with a pretraining vs classification head?

VisualBERT model can be used for pretraining(with Masked Language Modelling objective etc.) or it ccan be used for classification tasks like VQA etc. The difference is what losses are computed and which output from the base model is being used in these specific heads.

Additionally, what does it mean if I were to run training on a VisualBERT model with a classification head, with no resume file / model inputted. Does this train every layer of the model from scratch or just the last layer?

In this case it will train every layer of your model on the task you are training on. This model uses a cross entropy loss on the classification labels of your dataset.

For some reason when I run with a classification head, I get oddly high accuracies. On the other hand, with a pretraining head I start out with a total loss = 0 which looks weird.

This is expected. During classification you use the target labels from your dataset. When you are running with the pretrianing head it doesn't use labels, only does self supervised training with MLM(Masked Language Modelling).

g-luo commented 4 years ago

@vedanuj Thank you so much for the reply!

The main reason I started this thread is because I'm trying to debug a weird issue related to my validation accuracy. I was wondering if you had any insight into the behavior that is occurring:

Context: I am running VisualBERT on a new dataset (http://foilunitn.github.io), which essentially uses COCO 2014 images and generates its own set of annotations. The training set is from COCO2014 train and the test is from COCO2014 val. For this reason I have been using the COCO features included in MMF.

This is a binary classification task where the dataset is balanced. I generated my own validation set from the training set and ensured that all images and annotations between training / validation are disjoint.

Issue: My validation accuracy is consistently high (much higher than my test accuracy). Ie, the validation accuracy will be around 95% by the end of training but test will yield 74%. I've attached my config and logging file as well.

Archive.zip I suspect that somehow the validation data is being trained on, but I'm really not sure why I'm getting these really weird results.

Thank you so much for taking the time to reply to my questions!

apsdehal commented 4 years ago

This probably can happen as VisualGenome does have some overlap with val2014. But, in general cases, this doesn't affect the test-dev performance on VQA2. We will have a look at your configs. But expect a delay in response as this is very custom request.

apsdehal commented 4 years ago

Another question: Is the validation set that you sampled from train roughly balanced?

g-luo commented 4 years ago

Correct, I made sure that all sets (training, validation, test) are exactly balanced (50/50 of each class). I also tried using both COCO features available in the MMF zoo (default / resnet-152) and it exhibited similar behavior.

Just to confirm, were the resnet-152 features also trained on Visual Genome? Additionally, do you know of any pretrained models that were not trained upon COCO that I can potentially look into?

Thank you so much for looking into this!

g-luo commented 4 years ago

If it's helpful, all my code is in this repository: https://github.com/g-luo/foil_mmf https://github.com/g-luo/foil_mmf/blob/master/configs/foil_zoo.yaml

g-luo commented 4 years ago

@apsdehal Hello! I just wanted to add a comment that this may be an issue with the dataset itself, which I have been looking into.

I just had one last question about the models within the zoo -- is MMBT the only multimodal model in MMF that can process image features (or are there others?). I also noticed that it uses a pretrained resnet152 model (https://pytorch.org/docs/stable/torchvision/models.html), which has only been trained on ImageNet (no VisualGenome), for feature extraction?

I was just wondering if you could confirm this, then I'll just close the issue. Thanks so much for looking into this and for all your help!

apsdehal commented 4 years ago

Hi, By image features, do you means direct images or features extracted from images? Currently, MMBT and MMF Transformer can take in direct image as inputs. They will be converted to features based on the image encoder used (which can be a resnet152 model as you mentioned).