Open daurmur opened 2 years ago
try --cpu
?
It seams to work with for 1 thread with this script train_caption_stage1_medium_1.txt, but for this script train_caption_stage1_medium.txt I have following troubles: 5_0.06_6000.log, train_stage1.txt. Could you please help me found out where I made a mistake?
You are using multiple cpu device?
I've never done this before, and I am actually not sure about this... But seemingly you are using one device, and I believe there is no necessity to specify torch.distributed.
I have a server with 8 CPU cores and I want to utilize 4 of them. In order to run train_caption_stage1.sh
on CPU one just need to add --cpu
and comment out --fp16
, --fp16-scale-window=512
?
With following train_caption_stage1.sh
as
#!/usr/bin/env
# The port for communication. Note that if you want to run multiple tasks on the same machine,
# you need to specify different port numbers.
export MASTER_PORT=1061
log_dir=./stage1_logs
save_dir=./stage1_checkpoints
mkdir -p $log_dir $save_dir
bpe_dir=../../utils/BPE
user_dir=../../ofa_module
data_dir=../../dataset/caption_data
data=${data_dir}/caption_stage1_train.tsv,${data_dir}/caption_val.tsv
restore_file=../../checkpoints/ofa_medium.pt
selected_cols=0,4,2
task=caption
arch=ofa_medium
criterion=adjust_label_smoothed_cross_entropy
label_smoothing=0.1
lr=1e-5
max_epoch=5
warmup_ratio=0.06
batch_size=8
update_freq=4
resnet_drop_path_rate=0.0
encoder_drop_path_rate=0.1
decoder_drop_path_rate=0.1
dropout=0.1
attention_dropout=0.0
max_src_length=80
max_tgt_length=20
num_bins=1000
patch_image_size=480
eval_cider_cached=${data_dir}/cider_cached_tokens/coco-valid-words.p
drop_worst_ratio=0.2
for max_epoch in {5,}; do
echo "max_epoch "${max_epoch}
for warmup_ratio in {0.06,}; do
echo "warmup_ratio "${warmup_ratio}
for drop_worst_after in {6000,}; do
echo "drop_worst_after "${drop_worst_after}
log_file=${log_dir}/${max_epoch}"_"${warmup_ratio}"_"${drop_worst_after}".log"
save_path=${save_dir}/${max_epoch}"_"${warmup_ratio}"_"${drop_worst_after}
mkdir -p $save_path
#CUDA_VISIBLE_DEVICES=0,1,2,3
python3 -m torch.distributed.launch --nproc_per_node=4 --master_port=${MASTER_PORT} ../../train.py \
$data \
--selected-cols=${selected_cols} \
--bpe-dir=${bpe_dir} \
--user-dir=${user_dir} \
--restore-file=${restore_file} \
--reset-optimizer --reset-dataloader --reset-meters \
--save-dir=${save_path} \
--task=${task} \
--arch=${arch} \
--criterion=${criterion} \
--label-smoothing=${label_smoothing} \
--batch-size=${batch_size} \
--update-freq=${update_freq} \
--encoder-normalize-before \
--decoder-normalize-before \
--share-decoder-input-output-embed \
--share-all-embeddings \
--layernorm-embedding \
--patch-layernorm-embedding \
--code-layernorm-embedding \
--resnet-drop-path-rate=${resnet_drop_path_rate} \
--encoder-drop-path-rate=${encoder_drop_path_rate} \
--decoder-drop-path-rate=${decoder_drop_path_rate} \
--dropout=${dropout} \
--attention-dropout=${attention_dropout} \
--weight-decay=0.01 --optimizer=adam --adam-betas="(0.9,0.999)" --adam-eps=1e-08 --clip-norm=1.0 \
--lr-scheduler=polynomial_decay --lr=${lr} \
--max-epoch=${max_epoch} \
--warmup-ratio=${warmup_ratio} \
--log-format=simple --log-interval=10 \
--fixed-validation-seed=7 \
--no-epoch-checkpoints --keep-best-checkpoints=1 \
--save-interval=1 --validate-interval=1 \
--save-interval-updates=500 --validate-interval-updates=500 \
--eval-cider \
--eval-cider-cached-tokens=${eval_cider_cached} \
--eval-args='{"beam":5,"max_len_b":16,"no_repeat_ngram_size":3}' \
--best-checkpoint-metric=cider --maximize-best-checkpoint-metric \
--max-src-length=${max_src_length} \
--max-tgt-length=${max_tgt_length} \
--find-unused-parameters \
--freeze-encoder-embedding \
--freeze-decoder-embedding \
--add-type-embedding \
--scale-attn \
--scale-fc \
--scale-heads \
--disable-entangle \
--num-bins=${num_bins} \
--patch-image-size=${patch_image_size} \
--drop-worst-ratio=${drop_worst_ratio} \
--drop-worst-after=${drop_worst_after} \
# --fp16 \
# --fp16-scale-window=512 \
--cpu \
--num-workers=0 > ${log_file} 2>&1
done
done
done
I get following logs
/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
2022-08-15 11:20:00 - utils.py[line:258] - INFO: distributed init (rank 2): env://
2022-08-15 11:20:00 - utils.py[line:261] - INFO: Start init
Retry: 1, with value error <class 'RuntimeError'>
2022-08-15 11:20:00 - utils.py[line:258] - INFO: distributed init (rank 1): env://
2022-08-15 11:20:00 - utils.py[line:261] - INFO: Start init
Retry: 1, with value error <class 'RuntimeError'>
2022-08-15 11:20:00 - utils.py[line:258] - INFO: distributed init (rank 3): env://
2022-08-15 11:20:00 - utils.py[line:261] - INFO: Start init
Retry: 1, with value error <class 'RuntimeError'>
2022-08-15 11:20:00 - utils.py[line:258] - INFO: distributed init (rank 0): env://
2022-08-15 11:20:00 - utils.py[line:261] - INFO: Start init
Retry: 1, with value error <class 'RuntimeError'>
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 255) local_rank: 0 (pid: 11485) of binary: /opt/conda/envs/vector/bin/python3
Traceback (most recent call last):
File "/opt/conda/envs/vector/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/envs/vector/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
)(*cmd_args)
File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
../../train.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2022-08-15_11:20:03
host : 6e15b68b8333
rank : 1 (local_rank: 1)
exitcode : 255 (pid: 11486)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2022-08-15_11:20:03
host : 6e15b68b8333
rank : 2 (local_rank: 2)
exitcode : 255 (pid: 11487)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2022-08-15_11:20:03
host : 6e15b68b8333
rank : 3 (local_rank: 3)
exitcode : 255 (pid: 11488)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-08-15_11:20:03
host : 6e15b68b8333
rank : 0 (local_rank: 0)
exitcode : 255 (pid: 11485)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Thanks, it has given me some progress. I changed train_caption_stage1.sh
to
#!/usr/bin/env
# The port for communication. Note that if you want to run multiple tasks on the same machine,
# you need to specify different port numbers.
export MASTER_PORT=1061
log_dir=./stage1_logs
save_dir=./stage1_checkpoints
mkdir -p $log_dir $save_dir
bpe_dir=../../utils/BPE
user_dir=../../ofa_module
data_dir=../../dataset/caption_data
data=${data_dir}/caption_stage1_train.tsv,${data_dir}/caption_val.tsv
restore_file=../../checkpoints/ofa_medium.pt
selected_cols=0,4,2
task=caption
arch=ofa_medium
criterion=adjust_label_smoothed_cross_entropy
label_smoothing=0.1
lr=1e-5
max_epoch=5
warmup_ratio=0.06
batch_size=8
update_freq=4
resnet_drop_path_rate=0.0
encoder_drop_path_rate=0.1
decoder_drop_path_rate=0.1
dropout=0.1
attention_dropout=0.0
max_src_length=80
max_tgt_length=20
num_bins=1000
patch_image_size=480
eval_cider_cached=${data_dir}/cider_cached_tokens/coco-valid-words.p
drop_worst_ratio=0.2
for max_epoch in {5,}; do
echo "max_epoch "${max_epoch}
for warmup_ratio in {0.06,}; do
echo "warmup_ratio "${warmup_ratio}
for drop_worst_after in {6000,}; do
echo "drop_worst_after "${drop_worst_after}
log_file=${log_dir}/${max_epoch}"_"${warmup_ratio}"_"${drop_worst_after}".log"
save_path=${save_dir}/${max_epoch}"_"${warmup_ratio}"_"${drop_worst_after}
mkdir -p $save_path
CUDA_VISIBLE_DEVICES="" python3 -m torch.distributed.launch --nproc_per_node=4 --master_port=${MASTER_PORT} ../../train.py \
$data \
--distributed-backend=gloo \
--selected-cols=${selected_cols} \
--bpe-dir=${bpe_dir} \
--user-dir=${user_dir} \
--restore-file=${restore_file} \
--reset-optimizer --reset-dataloader --reset-meters \
--save-dir=${save_path} \
--task=${task} \
--arch=${arch} \
--criterion=${criterion} \
--label-smoothing=${label_smoothing} \
--batch-size=${batch_size} \
--update-freq=${update_freq} \
--encoder-normalize-before \
--decoder-normalize-before \
--share-decoder-input-output-embed \
--share-all-embeddings \
--layernorm-embedding \
--patch-layernorm-embedding \
--code-layernorm-embedding \
--resnet-drop-path-rate=${resnet_drop_path_rate} \
--encoder-drop-path-rate=${encoder_drop_path_rate} \
--decoder-drop-path-rate=${decoder_drop_path_rate} \
--dropout=${dropout} \
--attention-dropout=${attention_dropout} \
--weight-decay=0.01 --optimizer=adam --adam-betas="(0.9,0.999)" --adam-eps=1e-08 --clip-norm=1.0 \
--lr-scheduler=polynomial_decay --lr=${lr} \
--max-epoch=${max_epoch} \
--warmup-ratio=${warmup_ratio} \
--log-format=simple --log-interval=10 \
--fixed-validation-seed=7 \
--no-epoch-checkpoints --keep-best-checkpoints=1 \
--save-interval=1 --validate-interval=1 \
--save-interval-updates=500 --validate-interval-updates=500 \
--eval-cider \
--eval-cider-cached-tokens=${eval_cider_cached} \
--eval-args='{"beam":5,"max_len_b":16,"no_repeat_ngram_size":3}' \
--best-checkpoint-metric=cider --maximize-best-checkpoint-metric \
--max-src-length=${max_src_length} \
--max-tgt-length=${max_tgt_length} \
--find-unused-parameters \
--freeze-encoder-embedding \
--freeze-decoder-embedding \
--add-type-embedding \
--scale-attn \
--scale-fc \
--scale-heads \
--disable-entangle \
--num-bins=${num_bins} \
--patch-image-size=${patch_image_size} \
--drop-worst-ratio=${drop_worst_ratio} \
--drop-worst-after=${drop_worst_after} \
# --fp16 \
# --fp16-scale-window=512 \
--cpu \
--num-workers=0 > ${log_file} 2>&1
done
done
done
As you can see I added --cpu
and --distributed-backend=gloo
, changed CUDA_VISIBLE_DEVICES
and commented out --fp16, --fp16-scale-window=512
. Now I get another errors :D, so that's a step forward.
The 5_0.06_6000.log
looks like that:
train_caption_stage1_medium.sh: line 108: --cpu: command not found
and train_stage1.out
looks like that:
max_epoch 5
warmup_ratio 0.06
drop_worst_after 6000
/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
2022-08-16 07:20:20 - utils.py[line:258] - INFO: distributed init (rank 3): env://
2022-08-16 07:20:20 - utils.py[line:261] - INFO: Start init
2022-08-16 07:20:20 - utils.py[line:258] - INFO: distributed init (rank 1): env://
2022-08-16 07:20:20 - utils.py[line:261] - INFO: Start init
2022-08-16 07:20:20 - utils.py[line:258] - INFO: distributed init (rank 2): env://
2022-08-16 07:20:20 - utils.py[line:261] - INFO: Start init
2022-08-16 07:20:20 - utils.py[line:258] - INFO: distributed init (rank 0): env://
2022-08-16 07:20:20 - utils.py[line:261] - INFO: Start init
2022-08-16 07:20:20 - distributed_c10d.py[line:228] - INFO: Added key: store_based_barrier_key:1 to store for rank: 1
2022-08-16 07:20:20 - distributed_c10d.py[line:228] - INFO: Added key: store_based_barrier_key:1 to store for rank: 2
2022-08-16 07:20:20 - distributed_c10d.py[line:228] - INFO: Added key: store_based_barrier_key:1 to store for rank: 0
2022-08-16 07:20:20 - distributed_c10d.py[line:263] - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2022-08-16 07:20:20 - distributed_c10d.py[line:228] - INFO: Added key: store_based_barrier_key:1 to store for rank: 3
2022-08-16 07:20:20 - utils.py[line:274] - INFO: initialized host 6e15b68b8333 as rank 0
single-machine distributed training is initialized.
2022-08-16 07:20:20 - distributed_c10d.py[line:263] - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2022-08-16 07:20:20 - utils.py[line:274] - INFO: initialized host 6e15b68b8333 as rank 3
single-machine distributed training is initialized.
2022-08-16 07:20:20 - train.py[line:77] - INFO: {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 10, 'log_format': 'simple', 'log_file': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': '../../ofa_module', 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_world_size': 4, 'distributed_num_procs': 1, 'distributed_rank': 0, 'distributed_backend': 'gloo', 'distributed_init_method': 'env://', 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': True, 'ddp_backend': 'pytorch_ddp', 'ddp_comm_hook': 'none', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': True, 'gradient_as_bucket_view': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_algorithm': 'LocalSGD', 'localsgd_frequency': 3, 'nprocs_per_node': 1, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': False, 'memory_efficient_fp16': False, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': False}, 'dataset': {'_name': None, 'num_workers': 1, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': None, 'batch_size': 8, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': False, 'validate_interval': 1, 'validate_interval_updates': 500, 'validate_after_updates': 0, 'fixed_validation_seed': 7, 'disable_validation': False, 'max_tokens_valid': None, 'batch_size_valid': 8, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 5, 'max_update': 0, 'stop_time_hours': 0.0, 'clip_norm': 1.0, 'sentence_avg': False, 'update_freq': [4], 'lr': [1e-05], 'stop_min_lr': -1.0, 'use_bmuf': False}, 'checkpoint': {'_name': None, 'save_dir': './stage1_checkpoints/5_0.06_6000', 'restore_file': '../../checkpoints/ofa_medium.pt', 'finetune_from_model': None, 'reset_dataloader': True, 'reset_lr_scheduler': False, 'reset_meters': True, 'reset_optimizer': True, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 500, 'keep_interval_updates': -1, 'keep_interval_updates_pattern': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': 1, 'no_save': False, 'no_epoch_checkpoints': True, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'cider', 'maximize_best_checkpoint_metric': True, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': False, 'model_parallel_size': 1, 'use_ema_weights_to_init_param': False, 'use_latest_weights_to_init_ema': False}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 1}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': Namespace(_name='ofa_medium', activation_fn='gelu', adam_betas='(0.9,0.999)', adam_eps=1e-08, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, add_type_embedding=True, all_gather_list_size=16384, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, arch='ofa_medium', attention_dropout=0.0, attn_scale_factor=2, azureml_logging=False, batch_size=8, batch_size_valid=8, best_checkpoint_metric='cider', bf16=False, bpe=None, bpe_dir='../../utils/BPE', broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=1.0, code_dict_size=8192, code_image_size=128, code_layernorm_embedding=True, combine_valid_subsets=None, constraint_range=None, cpu=False, cpu_offload=False, criterion='adjust_label_smoothed_cross_entropy', cross_self_attention=False, curriculum=0, data='../../dataset/caption_data/caption_stage1_train.tsv,../../dataset/caption_data/caption_val.tsv', data_buffer_size=10, dataset_impl=None, ddp_backend='pytorch_ddp', ddp_comm_hook='none', decoder_attention_heads=8, decoder_drop_path_rate=0.1, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=2048, decoder_input_dim=512, decoder_layerdrop=0, decoder_layers=4, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_normalize_before=True, decoder_output_dim=512, device_id=0, disable_entangle=True, disable_validation=False, distributed_backend='gloo', distributed_init_method=None, distributed_no_spawn=False, distributed_num_procs=1, distributed_port=-1, distributed_rank=0, distributed_world_size=1, drop_worst_after=6000, drop_worst_ratio=0.2, dropout=0.1, ema_decay=0.9999, ema_fp32=False, ema_seed_model=None, ema_start_update=0, ema_update_freq=1, empty_cache_freq=0, encoder_attention_heads=8, encoder_drop_path_rate=0.1, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=2048, encoder_layerdrop=0, encoder_layers=4, encoder_layers_to_keep=None, encoder_learned_pos=True, encoder_normalize_before=True, end_learning_rate=0.0, entangle_position_embedding=False, eos=2, eval_args='{"beam":5,"max_len_b":16,"no_repeat_ngram_size":3}', eval_bleu=False, eval_cider=True, eval_cider_cached_tokens='../../dataset/caption_data/cider_cached_tokens/coco-valid-words.p', eval_print_samples=False, fast_stat_sync=False, find_unused_parameters=True, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=7, force_anneal=None, fp16=False, fp16_adam_stats=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, fp32_reduce_scatter=False, freeze_decoder_embedding=True, freeze_encoder_embedding=True, gen_subset='test', gradient_as_bucket_view=False, heartbeat_timeout=-1, ignore_eos=False, ignore_prefix_size=0, ignore_unused_valid_subsets=False, image_bucket_size=42, imagenet_default_mean_and_std=False, keep_best_checkpoints=1, keep_interval_updates=-1, keep_interval_updates_pattern=-1, keep_last_epochs=-1, label_smoothing=0.1, layernorm_embedding=True, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_file=None, log_format='simple', log_interval=10, lr=[1e-05], lr_scheduler='polynomial_decay', max_epoch=5, max_source_positions=1024, max_src_length=80, max_target_positions=1024, max_tgt_length=20, max_tokens=None, max_tokens_valid=None, max_update=0, max_valid_steps=None, maximize_best_checkpoint_metric=True, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_params_to_wrap=100000000, model_parallel_size=1, no_cross_attention=False, no_epoch_checkpoints=True, no_last_checkpoints=False, no_progress_bar=False, no_reshard_after_forward=False, no_save=False, no_save_optimizer_state=False, no_scale_embedding=True, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=1, num_bins=1000, num_shards=1, num_workers=1, on_cpu_convert_precision=False, optimizer='adam', optimizer_overrides='{}', orig_patch_image_size=256, pad=1, patch_image_size=480, patch_layernorm_embedding=True, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, plasma_path='/tmp/plasma', pooler_activation_fn='tanh', pooler_classifier='mlp', pooler_dropout=0.0, power=1.0, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, reg_alpha=1.0, relu_dropout=0.0, report_accuracy=False, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=True, reset_logging=False, reset_lr_scheduler=False, reset_meters=True, reset_optimizer=True, resnet_drop_path_rate=0.0, resnet_type='resnet101', restore_file='../../checkpoints/ofa_medium.pt', sample_patch_num=196, save_dir='./stage1_checkpoints/5_0.06_6000', save_interval=1, save_interval_updates=500, scale_attn=True, scale_fc=True, scale_heads=True, scale_resids=False, scoring='bleu', scst=False, scst_args='{}', seed=1, selected_cols='0,4,2', sentence_avg=False, shard_id=0, share_all_embeddings=True, share_decoder_input_output_embed=True, simul_type=None, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, stop_min_lr=-1.0, stop_time_hours=0, store_ema=False, suppress_crashes=False, sync_bn=False, task='caption', tensorboard_logdir=None, threshold_loss_scale=None, token_bucket_size=256, tokenizer=None, total_num_update=1000000, tpu=False, train_subset='train', unk=3, update_freq=[4], use_bmuf=False, use_ema_weights_to_init_param=False, use_latest_weights_to_init_ema=False, use_old_adam=False, use_plasma_view=False, use_rdrop=False, use_sharded_state=False, user_dir='../../ofa_module', valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=500, wandb_project=None, warmup_ratio=0.06, warmup_updates=0, weight_decay=0.01, write_checkpoints_asynchronously=False, zero_sharding='none'), 'task': {'_name': 'caption', 'data': '../../dataset/caption_data/caption_stage1_train.tsv,../../dataset/caption_data/caption_val.tsv', 'selected_cols': '0,4,2', 'bpe': None, 'bpe_dir': '../../utils/BPE', 'max_source_positions': 1024, 'max_target_positions': 1024, 'max_src_length': 80, 'max_tgt_length': 20, 'code_dict_size': 8192, 'patch_image_size': 480, 'orig_patch_image_size': 256, 'num_bins': 1000, 'imagenet_default_mean_and_std': False, 'constraint_range': None, 'eval_bleu': False, 'eval_cider': True, 'eval_args': '{"beam":5,"max_len_b":16,"no_repeat_ngram_size":3}', 'eval_print_samples': False, 'eval_cider_cached_tokens': '../../dataset/caption_data/cider_cached_tokens/coco-valid-words.p', 'scst': False, 'scst_args': '{}'}, 'criterion': {'_name': 'adjust_label_smoothed_cross_entropy', 'label_smoothing': 0.1, 'report_accuracy': False, 'ignore_prefix_size': 0, 'ignore_eos': False, 'sentence_avg': False, 'drop_worst_ratio': 0.2, 'drop_worst_after': 6000, 'use_rdrop': False, 'reg_alpha': 1.0, 'sample_patch_num': 196, 'constraint_range': None}, 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9,0.999)', 'adam_eps': 1e-08, 'weight_decay': 0.01, 'use_old_adam': False, 'fp16_adam_stats': False, 'tpu': False, 'lr': [1e-05]}, 'lr_scheduler': {'_name': 'polynomial_decay', 'warmup_updates': 0, 'warmup_ratio': 0.06, 'force_anneal': None, 'end_learning_rate': 0.0, 'power': 1.0, 'total_num_update': 1000000.0, 'lr': [1e-05]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None, 'ema': {'_name': None, 'store_ema': False, 'ema_decay': 0.9999, 'ema_start_update': 0, 'ema_seed_model': None, 'ema_update_freq': 1, 'ema_fp32': False}, 'simul_type': None}
2022-08-16 07:20:20 - distributed_c10d.py[line:263] - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2022-08-16 07:20:20 - distributed_c10d.py[line:263] - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2022-08-16 07:20:20 - utils.py[line:274] - INFO: initialized host 6e15b68b8333 as rank 1
single-machine distributed training is initialized.
2022-08-16 07:20:20 - utils.py[line:274] - INFO: initialized host 6e15b68b8333 as rank 2
single-machine distributed training is initialized.
2022-08-16 07:20:20 - ofa_task.py[line:109] - INFO: source dictionary: 59457 types
2022-08-16 07:20:20 - ofa_task.py[line:110] - INFO: target dictionary: 59457 types
/opt/conda/envs/vector/lib/python3.7/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2228.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
/opt/conda/envs/vector/lib/python3.7/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2228.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
/opt/conda/envs/vector/lib/python3.7/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2228.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
/opt/conda/envs/vector/lib/python3.7/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2228.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
local datafile ../../dataset/caption_data/caption_val.tsv slice_id 2 begin to initialize row_count and line_idx-to-offset mapping
2022-08-16 07:20:24 - train.py[line:101] - INFO: OFAModel(
(encoder): TransformerEncoder(
(dropout_module): FairseqDropout()
(embed_tokens): Embedding(59457, 512, padding_idx=1)
(layernorm_embedding): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(type_embedding): Embedding(2, 512)
(embed_images): ResNet(
(conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): Sequential(
(0): Bottleneck(
(conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(drop_path): Identity()
)
(1): Bottleneck(
(conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
(2): Bottleneck(
(conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
)
(layer2): Sequential(
(0): Bottleneck(
(conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(drop_path): Identity()
)
(1): Bottleneck(
(conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
(2): Bottleneck(
(conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
(3): Bottleneck(
(conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
)
(layer3): Sequential(
(0): Bottleneck(
(conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(drop_path): Identity()
)
(1): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
(2): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
(3): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
(4): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
(5): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
(6): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
(7): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
(8): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
(9): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
(10): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
(11): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
(12): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
(13): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
(14): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
(15): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
(16): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
(17): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
(18): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
(19): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
(20): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
(21): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
(22): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(drop_path): Identity()
)
)
)
(image_proj): Linear(in_features=1024, out_features=512, bias=True)
(patch_layernorm_embedding): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(embed_positions): Embedding(1026, 512)
(embed_image_positions): Embedding(1765, 512)
(pos_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(image_pos_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(pos_q_linear): Linear(in_features=512, out_features=512, bias=True)
(pos_k_linear): Linear(in_features=512, out_features=512, bias=True)
(layers): ModuleList(
(0): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(dropout_module): FairseqDropout()
(activation_dropout_module): FairseqDropout()
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(ffn_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(drop_path): Identity()
)
(1): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(dropout_module): FairseqDropout()
(activation_dropout_module): FairseqDropout()
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(ffn_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(drop_path): DropPath(p=0.03333333507180214)
)
(2): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(dropout_module): FairseqDropout()
(activation_dropout_module): FairseqDropout()
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(ffn_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(drop_path): DropPath(p=0.06666666269302368)
)
(3): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(dropout_module): FairseqDropout()
(activation_dropout_module): FairseqDropout()
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(ffn_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(drop_path): DropPath(p=0.10000000149011612)
)
)
(layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(token_rel_pos_table_list): ModuleList(
(0): Embedding(511, 8)
(1): Embedding(511, 8)
(2): Embedding(511, 8)
(3): Embedding(511, 8)
)
(image_rel_pos_table_list): ModuleList(
(0): Embedding(6892, 8)
(1): Embedding(6892, 8)
(2): Embedding(6892, 8)
(3): Embedding(6892, 8)
)
)
(decoder): TransformerDecoder(
(dropout_module): FairseqDropout()
(embed_tokens): Embedding(59457, 512, padding_idx=1)
(layernorm_embedding): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(embed_positions): Embedding(1026, 512)
(embed_image_positions): Embedding(1765, 512)
(pos_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(image_pos_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(self_pos_q_linear): Linear(in_features=512, out_features=512, bias=True)
(self_pos_k_linear): Linear(in_features=512, out_features=512, bias=True)
(cross_pos_q_linear): Linear(in_features=512, out_features=512, bias=True)
(cross_pos_k_linear): Linear(in_features=512, out_features=512, bias=True)
(code_layernorm_embedding): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(layers): ModuleList(
(0): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(cross_attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(ffn_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(drop_path): Identity()
)
(1): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(cross_attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(ffn_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(drop_path): DropPath(p=0.03333333507180214)
)
(2): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(cross_attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(ffn_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(drop_path): DropPath(p=0.06666666269302368)
)
(3): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(cross_attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(ffn_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(drop_path): DropPath(p=0.10000000149011612)
)
)
(layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(output_projection): Linear(in_features=512, out_features=59457, bias=False)
(token_rel_pos_table_list): ModuleList(
(0): Embedding(511, 8)
(1): Embedding(511, 8)
(2): Embedding(511, 8)
(3): Embedding(511, 8)
)
(image_rel_pos_table_list): ModuleList(
(0): Embedding(6892, 8)
(1): Embedding(6892, 8)
(2): Embedding(6892, 8)
(3): Embedding(6892, 8)
)
)
(classification_heads): ModuleDict()
)
2022-08-16 07:20:24 - train.py[line:102] - INFO: task: CaptionTask
2022-08-16 07:20:24 - train.py[line:103] - INFO: model: OFAModel
2022-08-16 07:20:24 - train.py[line:104] - INFO: criterion: AdjustLabelSmoothedCrossEntropyCriterion
2022-08-16 07:20:24 - train.py[line:108] - INFO: num. shared model params: 92,892,000 (num. trained: 62,450,016)
2022-08-16 07:20:24 - train.py[line:115] - INFO: num. expert model params: 0 (num. trained: 0)
local datafile ../../dataset/caption_data/caption_val.tsv slice_id 0 begin to initialize row_count and line_idx-to-offset mapping
local datafile ../../dataset/caption_data/caption_val.tsv slice_id 3 begin to initialize row_count and line_idx-to-offset mapping
local datafile ../../dataset/caption_data/caption_val.tsv slice_id 1 begin to initialize row_count and line_idx-to-offset mapping
local datafile ../../dataset/caption_data/caption_val.tsv slice_id 2 finished initializing row_count and line_idx-to-offset mapping
file ../../dataset/caption_data/caption_val.tsv slice_id 2 row count 1250 total row count 5000
/opt/conda/envs/vector/lib/python3.7/site-packages/torchvision/transforms/transforms.py:333: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
"Argument interpolation should be of type InterpolationMode instead of int. "
local datafile ../../dataset/caption_data/caption_val.tsv slice_id 0 finished initializing row_count and line_idx-to-offset mapping
file ../../dataset/caption_data/caption_val.tsv slice_id 0 row count 1250 total row count 5000
/opt/conda/envs/vector/lib/python3.7/site-packages/torchvision/transforms/transforms.py:333: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
"Argument interpolation should be of type InterpolationMode instead of int. "
local datafile ../../dataset/caption_data/caption_val.tsv slice_id 3 finished initializing row_count and line_idx-to-offset mapping
file ../../dataset/caption_data/caption_val.tsv slice_id 3 row count 1250 total row count 5000
/opt/conda/envs/vector/lib/python3.7/site-packages/torchvision/transforms/transforms.py:333: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
"Argument interpolation should be of type InterpolationMode instead of int. "
local datafile ../../dataset/caption_data/caption_val.tsv slice_id 1 finished initializing row_count and line_idx-to-offset mapping
file ../../dataset/caption_data/caption_val.tsv slice_id 1 row count 1250 total row count 5000
/opt/conda/envs/vector/lib/python3.7/site-packages/torchvision/transforms/transforms.py:333: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
"Argument interpolation should be of type InterpolationMode instead of int. "
2022-08-16 07:20:24 - distributed_c10d.py[line:228] - INFO: Added key: store_based_barrier_key:2 to store for rank: 0
2022-08-16 07:20:24 - distributed_c10d.py[line:263] - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer1.0.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer1.0.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer1.0.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer1.0.downsample.0.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer1.1.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer1.1.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer1.1.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer1.2.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer1.2.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer1.2.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.0.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.0.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.0.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.0.downsample.0.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.1.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.1.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.1.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.2.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.2.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.2.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.3.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.3.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer2.3.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.0.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.0.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.0.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.0.downsample.0.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.1.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.1.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.1.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.2.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.2.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.2.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.3.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.3.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.3.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.4.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.4.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.4.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.5.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.5.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.5.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.6.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.6.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.6.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.7.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.7.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.7.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.8.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.8.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.8.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.9.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.9.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.9.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.10.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.10.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.10.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.11.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.11.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.11.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.12.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.12.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.12.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.13.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.13.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.13.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.14.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.14.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.14.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.15.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.15.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.15.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.16.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.16.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.16.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.17.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.17.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.17.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.18.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.18.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.18.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.19.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.19.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.19.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.20.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.20.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.20.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.21.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.21.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.21.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.22.conv1.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.22.conv2.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- encoder.embed_images.layer3.22.conv3.bias
2022-08-16 07:20:24 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- decoder.output_projection.bias
2022-08-16 07:20:24 - train.py[line:145] - INFO: training on 4 devices (GPUs/TPUs)
2022-08-16 07:20:24 - train.py[line:151] - INFO: max tokens per device = None and max sentences per device = 8
2022-08-16 07:20:24 - trainer.py[line:458] - INFO: Preparing to load checkpoint ../../checkpoints/ofa_medium.pt
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/trainer.py", line 511, in load_checkpoint
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/trainer.py", line 511, in load_checkpoint
self.model.load_state_dict(self.model.load_state_dict(
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/trainer.py", line 257, in model
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/trainer.py", line 257, in model
device=self.device,device=self.device,
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/fairseq/fairseq/models/distributed_fairseq_model.py", line 66, in DistributedFairseqModel
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/fairseq/fairseq/models/distributed_fairseq_model.py", line 66, in DistributedFairseqModel
gradient_as_bucket_view=args.gradient_as_bucket_view,gradient_as_bucket_view=args.gradient_as_bucket_view,
File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__
File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__
{p.device for p in module.parameters()},
File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 674, in _log_and_throw
{p.device for p in module.parameters()},
File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 674, in _log_and_throw
raise err_type(err_msg)
ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [2], output_device 2, and module parameters {device(type='cpu')}.raise err_type(err_msg)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
ValueError: File "../../train.py", line 528, in <module>
DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device 0, and module parameters {device(type='cpu')}.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "../../train.py", line 528, in <module>
cli_main()cli_main()
File "../../train.py", line 521, in cli_main
File "../../train.py", line 521, in cli_main
distributed_utils.call_main(cfg, main)distributed_utils.call_main(cfg, main)
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/fairseq/fairseq/distributed/utils.py", line 374, in call_main
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/fairseq/fairseq/distributed/utils.py", line 374, in call_main
distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main
main(cfg, **kwargs)main(cfg, **kwargs)
File "../../train.py", line 161, in main
File "../../train.py", line 161, in main
disable_iterator_cache=True,disable_iterator_cache=True,
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/utils/checkpoint_utils.py", line 254, in load_checkpoint
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/utils/checkpoint_utils.py", line 254, in load_checkpoint
reset_meters=reset_meters,reset_meters=reset_meters,
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/trainer.py", line 526, in load_checkpoint
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/trainer.py", line 526, in load_checkpoint
"please ensure that the architectures match.".format(filename)"please ensure that the architectures match.".format(filename)
ExceptionException: : Cannot load model parameters from checkpoint ../../checkpoints/ofa_medium.pt; please ensure that the architectures match.Cannot load model parameters from checkpoint ../../checkpoints/ofa_medium.pt; please ensure that the architectures match.
Traceback (most recent call last):
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/trainer.py", line 511, in load_checkpoint
self.model.load_state_dict(
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/trainer.py", line 257, in model
device=self.device,
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/fairseq/fairseq/models/distributed_fairseq_model.py", line 66, in DistributedFairseqModel
gradient_as_bucket_view=args.gradient_as_bucket_view,
File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__
{p.device for p in module.parameters()},
File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 674, in _log_and_throw
raise err_type(err_msg)
ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [1], output_device 1, and module parameters {device(type='cpu')}.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "../../train.py", line 528, in <module>
cli_main()
File "../../train.py", line 521, in cli_main
distributed_utils.call_main(cfg, main)
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/fairseq/fairseq/distributed/utils.py", line 374, in call_main
Traceback (most recent call last):
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/trainer.py", line 511, in load_checkpoint
distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main
main(cfg, **kwargs)
File "../../train.py", line 161, in main
disable_iterator_cache=True,
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/utils/checkpoint_utils.py", line 254, in load_checkpoint
self.model.load_state_dict(
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/trainer.py", line 257, in model
reset_meters=reset_meters,
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/trainer.py", line 526, in load_checkpoint
device=self.device,
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/fairseq/fairseq/models/distributed_fairseq_model.py", line 66, in DistributedFairseqModel
gradient_as_bucket_view=args.gradient_as_bucket_view,
File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__
"please ensure that the architectures match.".format(filename)
Exception: Cannot load model parameters from checkpoint ../../checkpoints/ofa_medium.pt; please ensure that the architectures match.
{p.device for p in module.parameters()},
File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 674, in _log_and_throw
raise err_type(err_msg)
ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [3], output_device 3, and module parameters {device(type='cpu')}.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "../../train.py", line 528, in <module>
cli_main()
File "../../train.py", line 521, in cli_main
distributed_utils.call_main(cfg, main)
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/fairseq/fairseq/distributed/utils.py", line 374, in call_main
distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main
main(cfg, **kwargs)
File "../../train.py", line 161, in main
disable_iterator_cache=True,
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/utils/checkpoint_utils.py", line 254, in load_checkpoint
reset_meters=reset_meters,
File "/home/da-vector/da-vector-kaspi-marketplace/text-generation/OFA/trainer.py", line 526, in load_checkpoint
"please ensure that the architectures match.".format(filename)
Exception: Cannot load model parameters from checkpoint ../../checkpoints/ofa_medium.pt; please ensure that the architectures match.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 15221) of binary: /opt/conda/envs/vector/bin/python3
Traceback (most recent call last):
File "/opt/conda/envs/vector/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/envs/vector/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
)(*cmd_args)
File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/vector/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
../../train.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2022-08-16_07:20:33
host : 6e15b68b8333
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 15222)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2022-08-16_07:20:33
host : 6e15b68b8333
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 15223)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2022-08-16_07:20:33
host : 6e15b68b8333
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 15224)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-08-16_07:20:33
host : 6e15b68b8333
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 15221)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Is it possible to run train_caption_stage1.sh on CPU?