OFA-Sys / OFA

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Apache License 2.0
2.39k stars 248 forks source link

KeyError: 'ema' during inference on VQA #360

Closed jun297 closed 1 year ago

jun297 commented 1 year ago

Hi, I pretrained OFA-tiny on my private a tsv file in the form of only VQA (or a tsv file including only caption). For example, 1 000002b66c9c498e what is the danger for an object in the given image? 1.0|!+Person trips over Table relation qa

After pretaining, I tried to evalutate my model with the provided VQA dataset I got an error with both files vqa_val.tsv and vqa_test.tsv

Traceback (most recent call last):
  File "../../evaluate.py", line 184, in <module>
    cli_main()
  File "../../evaluate.py", line 178, in cli_main
    distributed_utils.call_main(
  File "/workspace/mindmap/fairseq/fairseq/distributed/utils.py", line 376, in call_main
    distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
  File "/workspace/mindmap/fairseq/fairseq/distributed/utils.py", line 350, in distributed_main
    main(cfg, **kwargs)
  File "../../evaluate.py", line 110, in main
    model.load_state_dict(checkpoint_utils.load_ema_from_checkpoint(ckpt_path)['model'])
  File "/workspace/mindmap/utils/checkpoint_utils.py", line 856, in load_ema_from_checkpoint
    model_params = new_state['extra_state']['ema']
KeyError: 'ema'

I guess I missed some parameters (for EMA) to resume, however, I did not change the scripts much.

This is the pretraining script I use:


#!/usr/bin/env

# The port for communication. Note that if you want to run multiple tasks on the same machine,
# you need to specify different port numbers.
export MASTER_PORT=1052
export CUDA_VISIBLE_DEVICES=0,1,2
export GPUS_PER_NODE=3

bpe_dir=../../utils/BPE
user_dir=../../ofa_module
num_threads=4
restore_file=../../checkpoints/ofa_base.pt

data_dir=/media/ssd1/users/jhkim/datasets/mindmap/pretrain_data
neg_sample_dir=${data_dir}/negative_sample
data=${data_dir}/train_vision_language_qa.tsv

selected_cols=0,1,2,3,4,5,6,7

task=unify_task
arch=ofa_tiny
criterion=adjust_label_smoothed_cross_entropy
label_smoothing=0.0
lr=1e-4
max_epoch=50
warmup_ratio=0.01
batch_size=8
update_freq=1
resnet_drop_path_rate=0.0
encoder_drop_path_rate=0.1
decoder_drop_path_rate=0.1
dropout=0.1
attention_dropout=0.0
max_src_length=80
max_tgt_length=30
num_bins=1000
patch_image_size=384
sample_patch_num=196
max_image_size=512

save_path=./checkpoints

OMP_NUM_THREADS=${num_threads} python3 -m torch.distributed.launch --nproc_per_node=${GPUS_PER_NODE} --master_port=${MASTER_PORT} ../../train.py \
  $data \
  --selected-cols=${selected_cols} \
  --bpe-dir=${bpe_dir} \
  --user-dir=${user_dir} \
  --restore-file=${restore_file} \
  --reset-optimizer --reset-dataloader --reset-meters \
  --save-dir=${save_path} \
  --neg-sample-dir=${neg_sample_dir} \
  --task=${task} \
  --arch=${arch} \
  --criterion=${criterion} \
  --label-smoothing=${label_smoothing} \
  --batch-size=${batch_size} \
  --update-freq=${update_freq} \
  --encoder-normalize-before \
  --decoder-normalize-before \
  --share-decoder-input-output-embed \
  --share-all-embeddings \
  --layernorm-embedding \
  --patch-layernorm-embedding \
  --code-layernorm-embedding \
  --resnet-drop-path-rate=${resnet_drop_path_rate} \
  --encoder-drop-path-rate=${encoder_drop_path_rate} \
  --decoder-drop-path-rate=${decoder_drop_path_rate} \
  --dropout=${dropout} \
  --attention-dropout=${attention_dropout} \
  --weight-decay=0.01 --optimizer=adam --adam-betas="(0.9,0.999)" --adam-eps=1e-08 --clip-norm=5.0 \
  --lr-scheduler=polynomial_decay --lr=${lr} \
  --max-epoch=${max_epoch} --warmup-ratio=${warmup_ratio} \
  --log-format=simple --log-interval=10 \
  --fixed-validation-seed=7 \
  --keep-last-epochs=15 \
  --save-interval=1 \
  --save-interval-updates=50000 \
  --disable-validation \
  --max-src-length=${max_src_length} \
  --max-tgt-length=${max_tgt_length} \
  --add-type-embedding \
  --scale-attn \
  --scale-fc \
  --scale-heads \
  --disable-entangle \
  --num-bins=${num_bins} \
  --patch-image-size=${patch_image_size} \
  --sample-patch-num=${sample_patch_num} \
  --max-image-size=${max_image_size} \
  --fp16 \
  --fp16-scale-window=128 \
  --num-workers=0

and this is the evaluation script I use:


#!/usr/bin/env bash

# The port for communication. Note that if you want to run multiple tasks on the same machine,
# you need to specify different port numbers.
export MASTER_PORT=8082

user_dir=../../ofa_module
bpe_dir=../../utils/BPE

# val or test
split=$1

data=/media/ssd1/users/jhkim/datasets/mindmap/finetuning/dataset/vqa_data/vqa_${split}.tsv
ans2label_file=../../dataset/vqa_data/trainval_ans2label.pkl
path=../../checkpoints/vqa_last.pt
#path=../../checkpoints/vqa_1_50000.pt
result_path=../../results/vqa_${split}_beam
selected_cols=0,5,2,3,4

CUDA_VISIBLE_DEVICES=0 python3 -m torch.distributed.launch --nproc_per_node=1 --master_port=${MASTER_PORT} ../../evaluate.py \
    ${data} \
    --path=${path} \
    --user-dir=${user_dir} \
    --task=vqa_gen \
    --batch-size=16 \
    --log-format=simple --log-interval=10 \
    --seed=7 \
    --gen-subset=${split} \
    --results-path=${result_path} \
    --fp16 \
    --ema-eval \
    --beam-search-vqa-eval \
    --beam=5 \
    --unnormalized \
    --temperature=1.0 \
    --num-workers=0 \
    --model-overrides="{\"data\":\"${data}\",\"bpe_dir\":\"${bpe_dir}\",\"selected_cols\":\"${selected_cols}\",\"ans2label_file\":\"${ans2label_file}\"}"

With pretrained models from checkpoints.md, I get another error

AssertionError: Could not infer task type from {'_name': 'denoising_unify', 'data': '../../dataset/vqa_data/vqa_val.tsv', 'num_bins': 1000, 'max_image_size': 512, 'no_text_data': False, 'no_image_data': False, 'selected_cols': '0,5,2,3,4', 'text_selected_cols': 'uniq_id,text', 'image_selected_cols': 'image_id,image,text,code,dataset_name', 'detection_selected_cols': 'image_id,image,text', 'bpe_dir': '../../utils/BPE', 'neg_sample_dir': None, 'max_source_positions': 1024, 'max_target_positions': 1024, 'max_src_length': 80, 'max_tgt_length': 30, 'code_dict_size': 8192, 'patch_image_size': 384, 'code_image_size': 128, 'patch_size': 16, 'num_mask_pixels': 23000, 'min_num_pixels': 6144, 'max_num_pixels': 16384, 'min_aspect_ratio': 0.3, 'max_aspect_ratio': None, 'eval_bleu': False, 'eval_cider': False, 'eval_args': '{}', 'eval_print_samples': False, 'eval_cider_cached_tokens': None, 'pretrain_seed': 7, 'mask_ratio': 0.3, 'random_ratio': 0.0, 'keep_ratio': 0.0, 'mask_length': 'span-poisson', 'poisson_lambda': 3.0, 'replace_length': 1, 'prev_output_noise_ratio': 0.0, 'remove_pure_text': False, 'remove_pure_image': False, 'remove_detection': False, 'remove_visual_grounding': False, 'remove_grounded_captioning': False}. Available argparse tasks: dict_keys(['hubert_pretraining', 'sentence_prediction', 'translation', 'online_backtranslation', 'multilingual_translation', 'semisupervised_translation', 'speech_to_text', 'simul_speech_to_text', 'simul_text_to_text', 'sentence_ranking', 'multilingual_masked_lm', 'denoising', 'legacy_masked_lm', 'multilingual_denoising', 'language_modeling', 'text_to_speech', 'frm_text_to_speech', 'audio_pretraining', 'cross_lingual_lm', 'translation_multi_simple_epoch', 'translation_lev', 'translation_from_pretrained_bart', 'audio_finetuning', 'translation_from_pretrained_xlm', 'masked_lm', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt', 'ofa', 'image_classify', 'caption', 'image_gen', 'refcoco', 'snli_ve', 'vqa_gen', 'gigaword', 'sst2', 'qqp', 'qnli', 'mrpc', 'rte', 'mnli', 'cola', 'unify_task', 'unify_speech_text_task']). Available hydra tasks: dict_keys(['hubert_pretraining', 'sentence_prediction', 'translation', 'simul_text_to_text', 'language_modeling', 'audio_pretraining', 'translation_lev', 'audio_finetuning', 'translation_from_pretrained_xlm', 'masked_lm', 'dummy_lm', 'dummy_masked_lm', 'ofa', 'image_classify', 'caption', 'image_gen', 'refcoco', 'snli_ve', 'vqa_gen', 'gigaword', 'sst2', 'qqp', 'qnli', 'mrpc', 'rte', 'mnli', 'cola', 'unify_task', 'unify_speech_text_task'])

I would appreciate any help!

jun297 commented 1 year ago

https://github.com/OFA-Sys/OFA/issues/221#issuecomment-1235182426

The evaluation script I used is only for the finetuned checkpoint.

The script run_scripts/vqa/evaluate_vqa_zeroshot.sh performs well.