Finetune Visual Bert pertained on COCO on VQA2

facebookresearch / mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

https://mmf.sh/

Other

5.5k stars 939 forks source link

Finetune Visual Bert pertained on COCO on VQA2 #621

Closed g-luo closed 4 years ago

g-luo commented 4 years ago

Hello! I'm trying to fine-tune visual bert on VQA2, but I'm getting this error: "Key input_ids not found in the SampleList" Screen Shot 2020-10-11 at 12 57 17 AM

Here are my config files: Archive.zip

I was running the command:

MMF_USER_DIR="." mmf_run config=./configs/other/vqa2.yaml\
        model=visual_bert \
        dataset=vqa2 \
        run_type=train_val \
        checkpoint.resume_pretrained=True \
        checkpoint.resume_zoo=visual_bert.pretrained.coco \
        training.num_workers=0

Thanks!

vedanuj commented 4 years ago

Hi can you paste a detailed error log?

g-luo commented 4 years ago

vedanuj commented 4 years ago

You will need to override the text_processor to use type: bert_tokenizer. I can see you are doing that already in your configs. Make sure the indentations in the yaml file are proper and try again.

g-luo commented 4 years ago

Thanks so much! My issue was that I was missing

processors:
      text_processor:
        type: bert_tokenizer
        params:
          tokenizer_config:
            type: bert-base-uncased
            params:
              do_lower_case: true
          mask_probability: 0
          max_seq_length: 128

What is the difference between the processors defined in the dataset config vs the actual train_val config? It it that fields in the dataset config are using when MMF is building the database vs things in the train_val config are using in the training itself?

g-luo commented 4 years ago

@vedanuj I was also wondering if I could return bounding box info from VisualBERT on VQA2? I was looking through previous issues and it looks like if I add the following to the config it will work; could you confirm?

return_features_info: true
transformer_bbox_processor:
        type: transformer_bbox
        params:
          bbox_key: bbox
          image_width_key: image_width
          image_height_key: image_height

Thanks!

vedanuj commented 4 years ago

@g-luo Processors defined in the dataset config are the default ones. You can override them in your train/experiment config.

For bounding box yes you can use transformer_bbox_processor. You can check the implementation of the processor to make sure that is the information you need in your model.

g-luo commented 4 years ago

@vedanuj I also had a question about the pretraining process--is it possible to reduce the amount of files generated in the "save" folder? I noticed that the "models" folder inside takes up a few GB, and it would be nice if MMF could not generate the folder and just output the best ckpt.

apsdehal commented 4 years ago

@g-luo You can set checkpoint.max_to_keep option to a lower number like 2 to only keep 2 maximum checkpoint at a particular time. https://github.com/facebookresearch/mmf/blob/master/mmf/configs/defaults.yaml#L273

g-luo commented 4 years ago

@apsdehal Thanks! I had another question related to visual bert and its embeddings. I'm trying to run the pretrained model from the original Github repo for VisualBERT (https://github.com/uclanlp/visualbert), but I'm running into an issue where the embedding shape is incorrect somehow.

For context, the visual_embedding_dim: 1024 for their model. I was wondering if it may have something to do with the features I'm using--I just used the standard mmf coco features that were included in the zoo.

Edit: I believe this issue is because the visual embedding dim the model was trained on is different from the one in the zoo.

g-luo commented 4 years ago

@apsdehal I had one last question about the coco features in the zoo (eg mmf://datasets/coco/defaults/features/trainval2014.tar.gz) -- I noticed that the bounding boxes have associated object #s, but I can't figure out what vocabulary this corresponds to. I was wondering how I can get the mapping between object #s and words?

apsdehal commented 4 years ago

@g-luo Check out the comment at this line: https://github.com/facebookresearch/mmf/blob/master/tools/scripts/features/extract_features_vmb.py#L3

For the other question, you figured it out right. It won't work with their model out of the box, you will have to add a projection layer to project our features to 1024 before passing to the model.

g-luo commented 4 years ago

Thanks so much!

The corresponding vocabulary is https://dl.fbaipublicfiles.com/pythia/data/visual_genome_categories.json for anyone looking at this thread.