OFA-Sys / OFA

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Apache License 2.0
2.43k stars 248 forks source link

Is it possible to run finetuning on CPU? #200

Open daurmur opened 2 years ago

daurmur commented 2 years ago

Is it possible to run train_caption_stage1.sh on CPU?

JustinLin610 commented 2 years ago

try --cpu?

daurmur commented 2 years ago

It seams to work with for 1 thread with this script train_caption_stage1_medium_1.txt, but for this script train_caption_stage1_medium.txt I have following troubles: 5_0.06_6000.log, train_stage1.txt. Could you please help me found out where I made a mistake?

JustinLin610 commented 2 years ago

You are using multiple cpu device?

JustinLin610 commented 2 years ago

I've never done this before, and I am actually not sure about this... But seemingly you are using one device, and I believe there is no necessity to specify torch.distributed.

daurmur commented 2 years ago

I have a server with 8 CPU cores and I want to utilize 4 of them. In order to run train_caption_stage1.sh on CPU one just need to add --cpu and comment out --fp16, --fp16-scale-window=512? With following train_caption_stage1.sh as

#!/usr/bin/env

# The port for communication. Note that if you want to run multiple tasks on the same machine,
# you need to specify different port numbers.
export MASTER_PORT=1061

log_dir=./stage1_logs
save_dir=./stage1_checkpoints
mkdir -p $log_dir $save_dir

bpe_dir=../../utils/BPE
user_dir=../../ofa_module

data_dir=../../dataset/caption_data
data=${data_dir}/caption_stage1_train.tsv,${data_dir}/caption_val.tsv
restore_file=../../checkpoints/ofa_medium.pt
selected_cols=0,4,2

task=caption
arch=ofa_medium
criterion=adjust_label_smoothed_cross_entropy
label_smoothing=0.1
lr=1e-5
max_epoch=5
warmup_ratio=0.06
batch_size=8
update_freq=4
resnet_drop_path_rate=0.0
encoder_drop_path_rate=0.1
decoder_drop_path_rate=0.1
dropout=0.1
attention_dropout=0.0
max_src_length=80
max_tgt_length=20
num_bins=1000
patch_image_size=480
eval_cider_cached=${data_dir}/cider_cached_tokens/coco-valid-words.p
drop_worst_ratio=0.2

for max_epoch in {5,}; do
  echo "max_epoch "${max_epoch}
  for warmup_ratio in {0.06,}; do
    echo "warmup_ratio "${warmup_ratio}
    for drop_worst_after in {6000,}; do
      echo "drop_worst_after "${drop_worst_after}

      log_file=${log_dir}/${max_epoch}"_"${warmup_ratio}"_"${drop_worst_after}".log"
      save_path=${save_dir}/${max_epoch}"_"${warmup_ratio}"_"${drop_worst_after}
      mkdir -p $save_path

      #CUDA_VISIBLE_DEVICES=0,1,2,3 
      python3 -m torch.distributed.launch --nproc_per_node=4 --master_port=${MASTER_PORT} ../../train.py \
          $data \
          --selected-cols=${selected_cols} \
          --bpe-dir=${bpe_dir} \
          --user-dir=${user_dir} \
          --restore-file=${restore_file} \
          --reset-optimizer --reset-dataloader --reset-meters \
          --save-dir=${save_path} \
          --task=${task} \
          --arch=${arch} \
          --criterion=${criterion} \
          --label-smoothing=${label_smoothing} \
          --batch-size=${batch_size} \
          --update-freq=${update_freq} \
          --encoder-normalize-before \
          --decoder-normalize-before \
          --share-decoder-input-output-embed \
          --share-all-embeddings \
          --layernorm-embedding \
          --patch-layernorm-embedding \
          --code-layernorm-embedding \
          --resnet-drop-path-rate=${resnet_drop_path_rate} \
          --encoder-drop-path-rate=${encoder_drop_path_rate} \
          --decoder-drop-path-rate=${decoder_drop_path_rate} \
          --dropout=${dropout} \
          --attention-dropout=${attention_dropout} \
          --weight-decay=0.01 --optimizer=adam --adam-betas="(0.9,0.999)" --adam-eps=1e-08 --clip-norm=1.0 \
          --lr-scheduler=polynomial_decay --lr=${lr} \
          --max-epoch=${max_epoch} \
          --warmup-ratio=${warmup_ratio} \
          --log-format=simple --log-interval=10 \
          --fixed-validation-seed=7 \
          --no-epoch-checkpoints --keep-best-checkpoints=1 \
          --save-interval=1 --validate-interval=1 \
          --save-interval-updates=500 --validate-interval-updates=500 \
          --eval-cider \
          --eval-cider-cached-tokens=${eval_cider_cached} \
          --eval-args='{"beam":5,"max_len_b":16,"no_repeat_ngram_size":3}' \
          --best-checkpoint-metric=cider --maximize-best-checkpoint-metric \
          --max-src-length=${max_src_length} \
          --max-tgt-length=${max_tgt_length} \
          --find-unused-parameters \
          --freeze-encoder-embedding \
          --freeze-decoder-embedding \
          --add-type-embedding \
          --scale-attn \
          --scale-fc \
          --scale-heads \
          --disable-entangle \
          --num-bins=${num_bins} \
          --patch-image-size=${patch_image_size} \
          --drop-worst-ratio=${drop_worst_ratio} \
          --drop-worst-after=${drop_worst_after} \
#           --fp16 \
#           --fp16-scale-window=512 \
          --cpu \
          --num-workers=0 > ${log_file} 2>&1
    done
  done
done

I get following logs

/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
2022-08-15 11:20:00 - utils.py[line:258] - INFO: distributed init (rank 2): env://
2022-08-15 11:20:00 - utils.py[line:261] - INFO: Start init
Retry: 1, with value error <class 'RuntimeError'>
2022-08-15 11:20:00 - utils.py[line:258] - INFO: distributed init (rank 1): env://
2022-08-15 11:20:00 - utils.py[line:261] - INFO: Start init
Retry: 1, with value error <class 'RuntimeError'>
2022-08-15 11:20:00 - utils.py[line:258] - INFO: distributed init (rank 3): env://
2022-08-15 11:20:00 - utils.py[line:261] - INFO: Start init
Retry: 1, with value error <class 'RuntimeError'>
2022-08-15 11:20:00 - utils.py[line:258] - INFO: distributed init (rank 0): env://
2022-08-15 11:20:00 - utils.py[line:261] - INFO: Start init
Retry: 1, with value error <class 'RuntimeError'>
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 255) local_rank: 0 (pid: 11485) of binary: /opt/conda/envs/vector/bin/python3
Traceback (most recent call last):
  File "/opt/conda/envs/vector/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/envs/vector/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
    )(*cmd_args)
  File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
../../train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-08-15_11:20:03
  host      : 6e15b68b8333
  rank      : 1 (local_rank: 1)
  exitcode  : 255 (pid: 11486)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2022-08-15_11:20:03
  host      : 6e15b68b8333
  rank      : 2 (local_rank: 2)
  exitcode  : 255 (pid: 11487)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2022-08-15_11:20:03
  host      : 6e15b68b8333
  rank      : 3 (local_rank: 3)
  exitcode  : 255 (pid: 11488)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-15_11:20:03
  host      : 6e15b68b8333
  rank      : 0 (local_rank: 0)
  exitcode  : 255 (pid: 11485)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
JustinLin610 commented 2 years ago

https://discuss.pytorch.org/t/how-to-use-multi-cpu-or-muti-cpu-core-to-train/147124/5 See if it helps

daurmur commented 2 years ago

Thanks, it has given me some progress. I changed train_caption_stage1.sh to

#!/usr/bin/env

# The port for communication. Note that if you want to run multiple tasks on the same machine,
# you need to specify different port numbers.
export MASTER_PORT=1061

log_dir=./stage1_logs
save_dir=./stage1_checkpoints
mkdir -p $log_dir $save_dir

bpe_dir=../../utils/BPE
user_dir=../../ofa_module

data_dir=../../dataset/caption_data
data=${data_dir}/caption_stage1_train.tsv,${data_dir}/caption_val.tsv
restore_file=../../checkpoints/ofa_medium.pt
selected_cols=0,4,2

task=caption
arch=ofa_medium
criterion=adjust_label_smoothed_cross_entropy
label_smoothing=0.1
lr=1e-5
max_epoch=5
warmup_ratio=0.06
batch_size=8
update_freq=4
resnet_drop_path_rate=0.0
encoder_drop_path_rate=0.1
decoder_drop_path_rate=0.1
dropout=0.1
attention_dropout=0.0
max_src_length=80
max_tgt_length=20
num_bins=1000
patch_image_size=480
eval_cider_cached=${data_dir}/cider_cached_tokens/coco-valid-words.p
drop_worst_ratio=0.2

for max_epoch in {5,}; do
  echo "max_epoch "${max_epoch}
  for warmup_ratio in {0.06,}; do
    echo "warmup_ratio "${warmup_ratio}
    for drop_worst_after in {6000,}; do
      echo "drop_worst_after "${drop_worst_after}

      log_file=${log_dir}/${max_epoch}"_"${warmup_ratio}"_"${drop_worst_after}".log"
      save_path=${save_dir}/${max_epoch}"_"${warmup_ratio}"_"${drop_worst_after}
      mkdir -p $save_path

      CUDA_VISIBLE_DEVICES="" python3 -m torch.distributed.launch --nproc_per_node=4 --master_port=${MASTER_PORT} ../../train.py \
          $data \
          --distributed-backend=gloo \
          --selected-cols=${selected_cols} \
          --bpe-dir=${bpe_dir} \
          --user-dir=${user_dir} \
          --restore-file=${restore_file} \
          --reset-optimizer --reset-dataloader --reset-meters \
          --save-dir=${save_path} \
          --task=${task} \
          --arch=${arch} \
          --criterion=${criterion} \
          --label-smoothing=${label_smoothing} \
          --batch-size=${batch_size} \
          --update-freq=${update_freq} \
          --encoder-normalize-before \
          --decoder-normalize-before \
          --share-decoder-input-output-embed \
          --share-all-embeddings \
          --layernorm-embedding \
          --patch-layernorm-embedding \
          --code-layernorm-embedding \
          --resnet-drop-path-rate=${resnet_drop_path_rate} \
          --encoder-drop-path-rate=${encoder_drop_path_rate} \
          --decoder-drop-path-rate=${decoder_drop_path_rate} \
          --dropout=${dropout} \
          --attention-dropout=${attention_dropout} \
          --weight-decay=0.01 --optimizer=adam --adam-betas="(0.9,0.999)" --adam-eps=1e-08 --clip-norm=1.0 \
          --lr-scheduler=polynomial_decay --lr=${lr} \
          --max-epoch=${max_epoch} \
          --warmup-ratio=${warmup_ratio} \
          --log-format=simple --log-interval=10 \
          --fixed-validation-seed=7 \
          --no-epoch-checkpoints --keep-best-checkpoints=1 \
          --save-interval=1 --validate-interval=1 \
          --save-interval-updates=500 --validate-interval-updates=500 \
          --eval-cider \
          --eval-cider-cached-tokens=${eval_cider_cached} \
          --eval-args='{"beam":5,"max_len_b":16,"no_repeat_ngram_size":3}' \
          --best-checkpoint-metric=cider --maximize-best-checkpoint-metric \
          --max-src-length=${max_src_length} \
          --max-tgt-length=${max_tgt_length} \
          --find-unused-parameters \
          --freeze-encoder-embedding \
          --freeze-decoder-embedding \
          --add-type-embedding \
          --scale-attn \
          --scale-fc \
          --scale-heads \
          --disable-entangle \
          --num-bins=${num_bins} \
          --patch-image-size=${patch_image_size} \
          --drop-worst-ratio=${drop_worst_ratio} \
          --drop-worst-after=${drop_worst_after} \
#           --fp16 \
#           --fp16-scale-window=512 \
          --cpu \
          --num-workers=0 > ${log_file} 2>&1
    done
  done
done

As you can see I added --cpu and --distributed-backend=gloo, changed CUDA_VISIBLE_DEVICES and commented out --fp16, --fp16-scale-window=512. Now I get another errors :D, so that's a step forward. The 5_0.06_6000.log looks like that:

train_caption_stage1_medium.sh: line 108: --cpu: command not found

and train_stage1.out looks like that:

max_epoch 5
warmup_ratio 0.06
drop_worst_after 6000
/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
2022-08-16 07:20:20 - utils.py[line:258] - INFO: distributed init (rank 3): env://
2022-08-16 07:20:20 - utils.py[line:261] - INFO: Start init
2022-08-16 07:20:20 - utils.py[line:258] - INFO: distributed init (rank 1): env://
2022-08-16 07:20:20 - utils.py[line:261] - INFO: Start init
2022-08-16 07:20:20 - utils.py[line:258] - INFO: distributed init (rank 2): env://
2022-08-16 07:20:20 - utils.py[line:261] - INFO: Start init
2022-08-16 07:20:20 - utils.py[line:258] - INFO: distributed init (rank 0): env://
2022-08-16 07:20:20 - utils.py[line:261] - INFO: Start init
2022-08-16 07:20:20 - distributed_c10d.py[line:228] - INFO: Added key: store_based_barrier_key:1 to store for rank: 1
2022-08-16 07:20:20 - distributed_c10d.py[line:228] - INFO: Added key: store_based_barrier_key:1 to store for rank: 2
2022-08-16 07:20:20 - distributed_c10d.py[line:228] - INFO: Added key: store_based_barrier_key:1 to store for rank: 0
2022-08-16 07:20:20 - distributed_c10d.py[line:263] - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2022-08-16 07:20:20 - distributed_c10d.py[line:228] - INFO: Added key: store_based_barrier_key:1 to store for rank: 3
2022-08-16 07:20:20 - utils.py[line:274] - INFO: initialized host 6e15b68b8333 as rank 0
single-machine distributed training is initialized.
2022-08-16 07:20:20 - distributed_c10d.py[line:263] - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2022-08-16 07:20:20 - utils.py[line:274] - INFO: initialized host 6e15b68b8333 as rank 3
single-machine distributed training is initialized.
2022-08-16 07:20:20 - train.py[line:77] - INFO: {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 10, 'log_format': 'simple', 'log_file': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': '../../ofa_module', 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_world_size': 4, 'distributed_num_procs': 1, 'distributed_rank': 0, 'distributed_backend': 'gloo', 'distributed_init_method': 'env://', 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': True, 'ddp_backend': 'pytorch_ddp', 'ddp_comm_hook': 'none', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': True, 'gradient_as_bucket_view': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_algorithm': 'LocalSGD', 'localsgd_frequency': 3, 'nprocs_per_node': 1, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': False, 'memory_efficient_fp16': False, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': False}, 'dataset': {'_name': None, 'num_workers': 1, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': None, 'batch_size': 8, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': False, 'validate_interval': 1, 'validate_interval_updates': 500, 'validate_after_updates': 0, 'fixed_validation_seed': 7, 'disable_validation': False, 'max_tokens_valid': None, 'batch_size_valid': 8, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 5, 'max_update': 0, 'stop_time_hours': 0.0, 'clip_norm': 1.0, 'sentence_avg': False, 'update_freq': [4], 'lr': [1e-05], 'stop_min_lr': -1.0, 'use_bmuf': False}, 'checkpoint': {'_name': None, 'save_dir': './stage1_checkpoints/5_0.06_6000', 'restore_file': '../../checkpoints/ofa_medium.pt', 'finetune_from_model': None, 'reset_dataloader': True, 'reset_lr_scheduler': False, 'reset_meters': True, 'reset_optimizer': True, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 500, 'keep_interval_updates': -1, 'keep_interval_updates_pattern': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': 1, 'no_save': False, 'no_epoch_checkpoints': True, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'cider', 'maximize_best_checkpoint_metric': True, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': False, 'model_parallel_size': 1, 'use_ema_weights_to_init_param': False, 'use_latest_weights_to_init_ema': False}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 1}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': Namespace(_name='ofa_medium', activation_fn='gelu', adam_betas='(0.9,0.999)', adam_eps=1e-08, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, add_type_embedding=True, all_gather_list_size=16384, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, arch='ofa_medium', attention_dropout=0.0, attn_scale_factor=2, azureml_logging=False, batch_size=8, batch_size_valid=8, best_checkpoint_metric='cider', bf16=False, bpe=None, bpe_dir='../../utils/BPE', broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=1.0, code_dict_size=8192, code_image_size=128, code_layernorm_embedding=True, combine_valid_subsets=None, constraint_range=None, cpu=False, cpu_offload=False, criterion='adjust_label_smoothed_cross_entropy', cross_self_attention=False, curriculum=0, data='../../dataset/caption_data/caption_stage1_train.tsv,../../dataset/caption_data/caption_val.tsv', data_buffer_size=10, dataset_impl=None, ddp_backend='pytorch_ddp', ddp_comm_hook='none', decoder_attention_heads=8, decoder_drop_path_rate=0.1, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=2048, decoder_input_dim=512, decoder_layerdrop=0, decoder_layers=4, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_normalize_before=True, decoder_output_dim=512, device_id=0, disable_entangle=True, disable_validation=False, distributed_backend='gloo', distributed_init_method=None, distributed_no_spawn=False, distributed_num_procs=1, distributed_port=-1, distributed_rank=0, distributed_world_size=1, drop_worst_after=6000, drop_worst_ratio=0.2, dropout=0.1, ema_decay=0.9999, ema_fp32=False, ema_seed_model=None, ema_start_update=0, ema_update_freq=1, empty_cache_freq=0, encoder_attention_heads=8, encoder_drop_path_rate=0.1, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=2048, encoder_layerdrop=0, encoder_layers=4, encoder_layers_to_keep=None, encoder_learned_pos=True, encoder_normalize_before=True, end_learning_rate=0.0, entangle_position_embedding=False, eos=2, eval_args='{"beam":5,"max_len_b":16,"no_repeat_ngram_size":3}', eval_bleu=False, eval_cider=True, eval_cider_cached_tokens='../../dataset/caption_data/cider_cached_tokens/coco-valid-words.p', eval_print_samples=False, fast_stat_sync=False, find_unused_parameters=True, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=7, force_anneal=None, fp16=False, fp16_adam_stats=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, fp32_reduce_scatter=False, freeze_decoder_embedding=True, freeze_encoder_embedding=True, gen_subset='test', gradient_as_bucket_view=False, heartbeat_timeout=-1, ignore_eos=False, ignore_prefix_size=0, ignore_unused_valid_subsets=False, image_bucket_size=42, imagenet_default_mean_and_std=False, keep_best_checkpoints=1, keep_interval_updates=-1, keep_interval_updates_pattern=-1, keep_last_epochs=-1, label_smoothing=0.1, layernorm_embedding=True, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_file=None, log_format='simple', log_interval=10, lr=[1e-05], lr_scheduler='polynomial_decay', max_epoch=5, max_source_positions=1024, max_src_length=80, max_target_positions=1024, max_tgt_length=20, max_tokens=None, max_tokens_valid=None, max_update=0, max_valid_steps=None, maximize_best_checkpoint_metric=True, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_params_to_wrap=100000000, model_parallel_size=1, no_cross_attention=False, no_epoch_checkpoints=True, no_last_checkpoints=False, no_progress_bar=False, no_reshard_after_forward=False, no_save=False, no_save_optimizer_state=False, no_scale_embedding=True, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=1, num_bins=1000, num_shards=1, num_workers=1, on_cpu_convert_precision=False, optimizer='adam', optimizer_overrides='{}', orig_patch_image_size=256, pad=1, patch_image_size=480, patch_layernorm_embedding=True, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, plasma_path='/tmp/plasma', pooler_activation_fn='tanh', pooler_classifier='mlp', pooler_dropout=0.0, power=1.0, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, reg_alpha=1.0, relu_dropout=0.0, report_accuracy=False, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=True, reset_logging=False, reset_lr_scheduler=False, reset_meters=True, reset_optimizer=True, resnet_drop_path_rate=0.0, resnet_type='resnet101', restore_file='../../checkpoints/ofa_medium.pt', sample_patch_num=196, save_dir='./stage1_checkpoints/5_0.06_6000', save_interval=1, save_interval_updates=500, scale_attn=True, scale_fc=True, scale_heads=True, scale_resids=False, scoring='bleu', scst=False, scst_args='{}', seed=1, selected_cols='0,4,2', sentence_avg=False, shard_id=0, share_all_embeddings=True, share_decoder_input_output_embed=True, simul_type=None, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, stop_min_lr=-1.0, stop_time_hours=0, store_ema=False, suppress_crashes=False, sync_bn=False, task='caption', tensorboard_logdir=None, threshold_loss_scale=None, token_bucket_size=256, tokenizer=None, total_num_update=1000000, tpu=False, train_subset='train', unk=3, update_freq=[4], use_bmuf=False, use_ema_weights_to_init_param=False, use_latest_weights_to_init_ema=False, use_old_adam=False, use_plasma_view=False, use_rdrop=False, use_sharded_state=False, user_dir='../../ofa_module', valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=500, wandb_project=None, warmup_ratio=0.06, warmup_updates=0, weight_decay=0.01, write_checkpoints_asynchronously=False, zero_sharding='none'), 'task': {'_name': 'caption', 'data': '../../dataset/caption_data/caption_stage1_train.tsv,../../dataset/caption_data/caption_val.tsv', 'selected_cols': '0,4,2', 'bpe': None, 'bpe_dir': '../../utils/BPE', 'max_source_positions': 1024, 'max_target_positions': 1024, 'max_src_length': 80, 'max_tgt_length': 20, 'code_dict_size': 8192, 'patch_image_size': 480, 'orig_patch_image_size': 256, 'num_bins': 1000, 'imagenet_default_mean_and_std': False, 'constraint_range': None, 'eval_bleu': False, 'eval_cider': True, 'eval_args': '{"beam":5,"max_len_b":16,"no_repeat_ngram_size":3}', 'eval_print_samples': False, 'eval_cider_cached_tokens': '../../dataset/caption_data/cider_cached_tokens/coco-valid-words.p', 'scst': False, 'scst_args': '{}'}, 'criterion': {'_name': 'adjust_label_smoothed_cross_entropy', 'label_smoothing': 0.1, 'report_accuracy': False, 'ignore_prefix_size': 0, 'ignore_eos': False, 'sentence_avg': False, 'drop_worst_ratio': 0.2, 'drop_worst_after': 6000, 'use_rdrop': False, 'reg_alpha': 1.0, 'sample_patch_num': 196, 'constraint_range': None}, 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9,0.999)', 'adam_eps': 1e-08, 'weight_decay': 0.01, 'use_old_adam': False, 'fp16_adam_stats': False, 'tpu': False, 'lr': [1e-05]}, 'lr_scheduler': {'_name': 'polynomial_decay', 'warmup_updates': 0, 'warmup_ratio': 0.06, 'force_anneal': None, 'end_learning_rate': 0.0, 'power': 1.0, 'total_num_update': 1000000.0, 'lr': [1e-05]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None, 'ema': {'_name': None, 'store_ema': False, 'ema_decay': 0.9999, 'ema_start_update': 0, 'ema_seed_model': None, 'ema_update_freq': 1, 'ema_fp32': False}, 'simul_type': None}
2022-08-16 07:20:20 - distributed_c10d.py[line:263] - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2022-08-16 07:20:20 - distributed_c10d.py[line:263] - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2022-08-16 07:20:20 - utils.py[line:274] - INFO: initialized host 6e15b68b8333 as rank 1
single-machine distributed training is initialized.
2022-08-16 07:20:20 - utils.py[line:274] - INFO: initialized host 6e15b68b8333 as rank 2
single-machine distributed training is initialized.
2022-08-16 07:20:20 - ofa_task.py[line:109] - INFO: source dictionary: 59457 types
2022-08-16 07:20:20 - ofa_task.py[line:110] - INFO: target dictionary: 59457 types
/opt/conda/envs/vector/lib/python3.7/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2228.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/opt/conda/envs/vector/lib/python3.7/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2228.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/opt/conda/envs/vector/lib/python3.7/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2228.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/opt/conda/envs/vector/lib/python3.7/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2228.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
local datafile ../../dataset/caption_data/caption_val.tsv slice_id 2 begin to initialize row_count and line_idx-to-offset mapping
2022-08-16 07:20:24 - train.py[line:101] - INFO: OFAModel(
  (encoder): TransformerEncoder(
    (dropout_module): FairseqDropout()
    (embed_tokens): Embedding(59457, 512, padding_idx=1)
    (layernorm_embedding): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (type_embedding): Embedding(2, 512)
    (embed_images): ResNet(
      (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
      (layer1): Sequential(
        (0): Bottleneck(
          (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (downsample): Sequential(
            (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          )
          (drop_path): Identity()
        )
        (1): Bottleneck(
          (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
        (2): Bottleneck(
          (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
      )
      (layer2): Sequential(
        (0): Bottleneck(
          (conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (downsample): Sequential(
            (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
            (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          )
          (drop_path): Identity()
        )
        (1): Bottleneck(
          (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
        (2): Bottleneck(
          (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
        (3): Bottleneck(
          (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
      )
      (layer3): Sequential(
        (0): Bottleneck(
          (conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (downsample): Sequential(
            (0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False)
            (1): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          )
          (drop_path): Identity()
        )
        (1): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
        (2): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
        (3): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
        (4): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
        (5): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
        (6): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
        (7): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
        (8): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
        (9): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
        (10): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
        (11): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
        (12): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
        (13): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
        (14): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
        (15): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
        (16): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
        (17): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
        (18): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
        (19): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
        (20): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
        (21): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
        (22): Bottleneck(
          (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (drop_path): Identity()
        )
      )
    )
    (image_proj): Linear(in_features=1024, out_features=512, bias=True)
    (patch_layernorm_embedding): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (embed_positions): Embedding(1026, 512)
    (embed_image_positions): Embedding(1765, 512)
    (pos_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (image_pos_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (pos_q_linear): Linear(in_features=512, out_features=512, bias=True)
    (pos_k_linear): Linear(in_features=512, out_features=512, bias=True)
    (layers): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout_module): FairseqDropout()
        (activation_dropout_module): FairseqDropout()
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (ffn_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (drop_path): Identity()
      )
      (1): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout_module): FairseqDropout()
        (activation_dropout_module): FairseqDropout()
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (ffn_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (drop_path): DropPath(p=0.03333333507180214)
      )
      (2): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout_module): FairseqDropout()
        (activation_dropout_module): FairseqDropout()
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (ffn_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (drop_path): DropPath(p=0.06666666269302368)
      )
      (3): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout_module): FairseqDropout()
        (activation_dropout_module): FairseqDropout()
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (ffn_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (drop_path): DropPath(p=0.10000000149011612)
      )
    )
    (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (token_rel_pos_table_list): ModuleList(
      (0): Embedding(511, 8)
      (1): Embedding(511, 8)
      (2): Embedding(511, 8)
      (3): Embedding(511, 8)
    )
    (image_rel_pos_table_list): ModuleList(
      (0): Embedding(6892, 8)
      (1): Embedding(6892, 8)
      (2): Embedding(6892, 8)
      (3): Embedding(6892, 8)
    )
  )
  (decoder): TransformerDecoder(
    (dropout_module): FairseqDropout()
    (embed_tokens): Embedding(59457, 512, padding_idx=1)
    (layernorm_embedding): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (embed_positions): Embedding(1026, 512)
    (embed_image_positions): Embedding(1765, 512)
    (pos_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (image_pos_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (self_pos_q_linear): Linear(in_features=512, out_features=512, bias=True)
    (self_pos_k_linear): Linear(in_features=512, out_features=512, bias=True)
    (cross_pos_q_linear): Linear(in_features=512, out_features=512, bias=True)
    (cross_pos_k_linear): Linear(in_features=512, out_features=512, bias=True)
    (code_layernorm_embedding): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (layers): ModuleList(
      (0): TransformerDecoderLayer(
        (dropout_module): FairseqDropout()
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (self_attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (cross_attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (activation_dropout_module): FairseqDropout()
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (ffn_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (drop_path): Identity()
      )
      (1): TransformerDecoderLayer(
        (dropout_module): FairseqDropout()
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (self_attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (cross_attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (activation_dropout_module): FairseqDropout()
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (ffn_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (drop_path): DropPath(p=0.03333333507180214)
      )
      (2): TransformerDecoderLayer(
        (dropout_module): FairseqDropout()
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (self_attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (cross_attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (activation_dropout_module): FairseqDropout()
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (ffn_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (drop_path): DropPath(p=0.06666666269302368)
      )
      (3): TransformerDecoderLayer(
        (dropout_module): FairseqDropout()
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (self_attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (cross_attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (activation_dropout_module): FairseqDropout()
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (ffn_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (drop_path): DropPath(p=0.10000000149011612)
      )
    )
    (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (output_projection): Linear(in_features=512, out_features=59457, bias=False)
    (token_rel_pos_table_list): ModuleList(
      (0): Embedding(511, 8)
      (1): Embedding(511, 8)
      (2): Embedding(511, 8)
      (3): Embedding(511, 8)
    )
    (image_rel_pos_table_list): ModuleList(
      (0): Embedding(6892, 8)
      (1): Embedding(6892, 8)
      (2): Embedding(6892, 8)
      (3): Embedding(6892, 8)
    )
  )
  (classification_heads): ModuleDict()
)
2022-08-16 07:20:24 - train.py[line:102] - INFO: task: CaptionTask
2022-08-16 07:20:24 - train.py[line:103] - INFO: model: OFAModel
2022-08-16 07:20:24 - train.py[line:104] - INFO: criterion: AdjustLabelSmoothedCrossEntropyCriterion
2022-08-16 07:20:24 - train.py[line:108] - INFO: num. shared model params: 92,892,000 (num. trained: 62,450,016)
2022-08-16 07:20:24 - train.py[line:115] - INFO: num. expert model params: 0 (num. trained: 0)
local datafile ../../dataset/caption_data/caption_val.tsv slice_id 0 begin to initialize row_count and line_idx-to-offset mapping
local datafile ../../dataset/caption_data/caption_val.tsv slice_id 3 begin to initialize row_count and line_idx-to-offset mapping
local datafile ../../dataset/caption_data/caption_val.tsv slice_id 1 begin to initialize row_count and line_idx-to-offset mapping
local datafile ../../dataset/caption_data/caption_val.tsv slice_id 2 finished initializing row_count and line_idx-to-offset mapping
file ../../dataset/caption_data/caption_val.tsv slice_id 2 row count 1250 total row count 5000
/opt/conda/envs/vector/lib/python3.7/site-packages/torchvision/transforms/transforms.py:333: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
  "Argument interpolation should be of type InterpolationMode instead of int. "
local datafile ../../dataset/caption_data/caption_val.tsv slice_id 0 finished initializing row_count and line_idx-to-offset mapping
file ../../dataset/caption_data/caption_val.tsv slice_id 0 row count 1250 total row count 5000
/opt/conda/envs/vector/lib/python3.7/site-packages/torchvision/transforms/transforms.py:333: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
  "Argument interpolation should be of type InterpolationMode instead of int. "
local datafile ../../dataset/caption_data/caption_val.tsv slice_id 3 finished initializing row_count and line_idx-to-offset mapping
file ../../dataset/caption_data/caption_val.tsv slice_id 3 row count 1250 total row count 5000
/opt/conda/envs/vector/lib/python3.7/site-packages/torchvision/transforms/transforms.py:333: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
  "Argument interpolation should be of type InterpolationMode instead of int. "
local datafile ../../dataset/caption_data/caption_val.tsv slice_id 1 finished initializing row_count and line_idx-to-offset mapping
file ../../dataset/caption_data/caption_val.tsv slice_id 1 row count 1250 total row count 5000
/opt/conda/envs/vector/lib/python3.7/site-packages/torchvision/transforms/transforms.py:333: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
  "Argument interpolation should be of type InterpolationMode instead of int. "
2022-08-16 07:20:24 - distributed_c10d.py[line:228] - INFO: Added key: store_based_barrier_key:2 to store for rank: 0
2022-08-16 07:20:24 - distributed_c10d.py[line:263] - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer1.0.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer1.0.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer1.0.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer1.0.downsample.0.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer1.1.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer1.1.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer1.1.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer1.2.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer1.2.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer1.2.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.0.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.0.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.0.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.0.downsample.0.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.1.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.1.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.1.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.2.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.2.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.2.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.3.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.3.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.3.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.0.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.0.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.0.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.0.downsample.0.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.1.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.1.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.1.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.2.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.2.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.2.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.3.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.3.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.3.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.4.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.4.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.4.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.5.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.5.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.5.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.6.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.6.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.6.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.7.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.7.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.7.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.8.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.8.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.8.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.9.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.9.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.9.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.10.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.10.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.10.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.11.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.11.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.11.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.12.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.12.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.12.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.13.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.13.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.13.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.14.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.14.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.14.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.15.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.15.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.15.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.16.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.16.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.16.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.17.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.17.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.17.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.18.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.18.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.18.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.19.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.19.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.19.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.20.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.20.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.20.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.21.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.21.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.21.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.22.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.22.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.22.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- decoder.output_projection.bias
2022-08-16 07:20:24 - train.py[line:145] - INFO: training on 4 devices (GPUs/TPUs)
2022-08-16 07:20:24 - train.py[line:151] - INFO: max tokens per device = None and max sentences per device = 8
2022-08-16 07:20:24 - trainer.py[line:458] - INFO: Preparing to load checkpoint ../../checkpoints/ofa_medium.pt
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/trainer.py", line 511, in load_checkpoint
  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/trainer.py", line 511, in load_checkpoint
        self.model.load_state_dict(self.model.load_state_dict(

  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/trainer.py", line 257, in model
  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/trainer.py", line 257, in model
        device=self.device,device=self.device,

  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/fairseq/fairseq/models/distributed_fairseq_model.py", line 66, in DistributedFairseqModel
  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/fairseq/fairseq/models/distributed_fairseq_model.py", line 66, in DistributedFairseqModel
        gradient_as_bucket_view=args.gradient_as_bucket_view,gradient_as_bucket_view=args.gradient_as_bucket_view,

  File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__
  File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__
    {p.device for p in module.parameters()},
  File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 674, in _log_and_throw
    {p.device for p in module.parameters()},
  File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 674, in _log_and_throw
    raise err_type(err_msg)
ValueError:     DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [2], output_device 2, and module parameters {device(type='cpu')}.raise err_type(err_msg)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
ValueError:   File "../../train.py", line 528, in <module>
DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device 0, and module parameters {device(type='cpu')}.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "../../train.py", line 528, in <module>
        cli_main()cli_main()

  File "../../train.py", line 521, in cli_main
  File "../../train.py", line 521, in cli_main
        distributed_utils.call_main(cfg, main)distributed_utils.call_main(cfg, main)

  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/fairseq/fairseq/distributed/utils.py", line 374, in call_main
  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/fairseq/fairseq/distributed/utils.py", line 374, in call_main
        distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)

  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main
  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main
        main(cfg, **kwargs)main(cfg, **kwargs)

  File "../../train.py", line 161, in main
  File "../../train.py", line 161, in main
        disable_iterator_cache=True,disable_iterator_cache=True,

  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/utils/checkpoint_utils.py", line 254, in load_checkpoint
  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/utils/checkpoint_utils.py", line 254, in load_checkpoint
        reset_meters=reset_meters,reset_meters=reset_meters,

  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/trainer.py", line 526, in load_checkpoint
  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/trainer.py", line 526, in load_checkpoint
        "please ensure that the architectures match.".format(filename)"please ensure that the architectures match.".format(filename)

ExceptionException: : Cannot load model parameters from checkpoint ../../checkpoints/ofa_medium.pt; please ensure that the architectures match.Cannot load model parameters from checkpoint ../../checkpoints/ofa_medium.pt; please ensure that the architectures match.

Traceback (most recent call last):
  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/trainer.py", line 511, in load_checkpoint
    self.model.load_state_dict(
  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/trainer.py", line 257, in model
    device=self.device,
  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/fairseq/fairseq/models/distributed_fairseq_model.py", line 66, in DistributedFairseqModel
    gradient_as_bucket_view=args.gradient_as_bucket_view,
  File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__
    {p.device for p in module.parameters()},
  File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 674, in _log_and_throw
    raise err_type(err_msg)
ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [1], output_device 1, and module parameters {device(type='cpu')}.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "../../train.py", line 528, in <module>
    cli_main()
  File "../../train.py", line 521, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/fairseq/fairseq/distributed/utils.py", line 374, in call_main
Traceback (most recent call last):
  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/trainer.py", line 511, in load_checkpoint
    distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main
    main(cfg, **kwargs)
  File "../../train.py", line 161, in main
    disable_iterator_cache=True,
  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/utils/checkpoint_utils.py", line 254, in load_checkpoint
    self.model.load_state_dict(
  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/trainer.py", line 257, in model
    reset_meters=reset_meters,
  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/trainer.py", line 526, in load_checkpoint
    device=self.device,
  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/fairseq/fairseq/models/distributed_fairseq_model.py", line 66, in DistributedFairseqModel
    gradient_as_bucket_view=args.gradient_as_bucket_view,
  File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__
    "please ensure that the architectures match.".format(filename)
Exception: Cannot load model parameters from checkpoint ../../checkpoints/ofa_medium.pt; please ensure that the architectures match.
    {p.device for p in module.parameters()},
  File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 674, in _log_and_throw
    raise err_type(err_msg)
ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [3], output_device 3, and module parameters {device(type='cpu')}.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "../../train.py", line 528, in <module>
    cli_main()
  File "../../train.py", line 521, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/fairseq/fairseq/distributed/utils.py", line 374, in call_main
    distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main
    main(cfg, **kwargs)
  File "../../train.py", line 161, in main
    disable_iterator_cache=True,
  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/utils/checkpoint_utils.py", line 254, in load_checkpoint
    reset_meters=reset_meters,
  File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/trainer.py", line 526, in load_checkpoint
    "please ensure that the architectures match.".format(filename)
Exception: Cannot load model parameters from checkpoint ../../checkpoints/ofa_medium.pt; please ensure that the architectures match.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 15221) of binary: /opt/conda/envs/vector/bin/python3
Traceback (most recent call last):
  File "/opt/conda/envs/vector/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/envs/vector/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
    )(*cmd_args)
  File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
../../train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-08-16_07:20:33
  host      : 6e15b68b8333
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 15222)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2022-08-16_07:20:33
  host      : 6e15b68b8333
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 15223)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2022-08-16_07:20:33
  host      : 6e15b68b8333
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 15224)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-16_07:20:33
  host      : 6e15b68b8333
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 15221)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================