OFA-Sys / OFA

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Apache License 2.0
2.42k stars 248 forks source link

unrecognized arguments: --warmup-ratio #88

Closed AI-EnabledSoftwareEngineering-AISE closed 2 years ago

AI-EnabledSoftwareEngineering-AISE commented 2 years ago

Hi, I am trying to train your model for the caption task, to do that I clone your last updated repository, and then I have followed your instruction. first of all, I faced a max_epoch error that was because of the shell version. After that I tried to train the model it gives me unrecognized arguments: --warmup-ratio=0.06. I go to your train code and I could not find the warmup-ratio variable, did you remove it? How should I solve this issue?

/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
                [--aim-repo AIM_REPO] [--aim-run-hash AIM_RUN_HASH]
                [--tensorboard-logdir TENSORBOARD_LOGDIR]
                [--wandb-project WANDB_PROJECT] [--azureml-logging]
                [--seed SEED] [--cpu] [--tpu] [--bf16]
                [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16]
                [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE]
                [--fp16-scale-window FP16_SCALE_WINDOW]
                [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                [--on-cpu-convert-precision] [--min-loss-scale MIN_LOSS_SCALE]
                [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--amp]
                [--amp-batch-retries AMP_BATCH_RETRIES]
                [--amp-init-scale AMP_INIT_SCALE]
                [--amp-scale-window AMP_SCALE_WINDOW] [--user-dir USER_DIR]
                [--empty-cache-freq EMPTY_CACHE_FREQ]
                [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                [--model-parallel-size MODEL_PARALLEL_SIZE]
                [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                [--profile] [--reset-logging] [--suppress-crashes]
                [--use-plasma-view] [--plasma-path PLASMA_PATH]
                [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy,scst_reward_criterion,adjust_label_smoothed_cross_entropy,clip_scst_reward_criterion,adjust_label_smoothed_encouraging_loss}]
                [--tokenizer {moses,nltk,space}]
                [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                [--task TASK] [--num-workers NUM_WORKERS]
                [--skip-invalid-size-inputs-valid-test]
                [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
                [--data-buffer-size DATA_BUFFER_SIZE]
                [--train-subset TRAIN_SUBSET] [--valid-subset VALID_SUBSET]
                [--combine-valid-subsets] [--ignore-unused-valid-subsets]
                [--validate-interval VALIDATE_INTERVAL]
                [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                [--validate-after-updates VALIDATE_AFTER_UPDATES]
                [--fixed-validation-seed FIXED_VALIDATION_SEED]
                [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
                [--batch-size-valid BATCH_SIZE_VALID]
                [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM]
                [--gen-subset GEN_SUBSET] [--num-shards NUM_SHARDS]
                [--shard-id SHARD_ID] [--grouped-shuffling]
                [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
                [--update-ordered-indices-seed]
                [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                [--distributed-num-procs DISTRIBUTED_NUM_PROCS]
                [--distributed-rank DISTRIBUTED_RANK]
                [--distributed-backend DISTRIBUTED_BACKEND]
                [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
                [--distributed-no-spawn]
                [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
                [--ddp-comm-hook {none,fp16}] [--bucket-cap-mb BUCKET_CAP_MB]
                [--fix-batches-to-gpus] [--find-unused-parameters]
                [--gradient-as-bucket-view] [--fast-stat-sync]
                [--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers]
                [--slowmo-momentum SLOWMO_MOMENTUM]
                [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
                [--localsgd-frequency LOCALSGD_FREQUENCY]
                [--nprocs-per-node NPROCS_PER_NODE]
                [--pipeline-model-parallel]
                [--pipeline-balance PIPELINE_BALANCE]
                [--pipeline-devices PIPELINE_DEVICES]
                [--pipeline-chunks PIPELINE_CHUNKS]
                [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                [--pipeline-checkpoint {always,never,except_last}]
                [--zero-sharding {none,os}] [--no-reshard-after-forward]
                [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
                [--not-fsdp-flatten-parameters] [--arch ARCH]
                [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE]
                [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
                [--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR]
                [--stop-min-lr STOP_MIN_LR] [--use-bmuf]
                [--skip-remainder-batch] [--save-dir SAVE_DIR]
                [--restore-file RESTORE_FILE] [--continue-once CONTINUE_ONCE]
                [--finetune-from-model FINETUNE_FROM_MODEL]
                [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters]
                [--reset-optimizer]
                [--optimizer-overrides OPTIMIZER_OVERRIDES]
                [--save-interval SAVE_INTERVAL]
                [--save-interval-updates SAVE_INTERVAL_UPDATES]
                [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                [--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN]
                [--keep-last-epochs KEEP_LAST_EPOCHS]
                [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save]
                [--no-epoch-checkpoints] [--no-last-checkpoints]
                [--no-save-optimizer-state]
                [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                [--maximize-best-checkpoint-metric] [--patience PATIENCE]
                [--checkpoint-suffix CHECKPOINT_SUFFIX]
                [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                [--load-checkpoint-on-all-dp-ranks]
                [--write-checkpoints-asynchronously] [--store-ema]
                [--ema-decay EMA_DECAY] [--ema-start-update EMA_START_UPDATE]
                [--ema-seed-model EMA_SEED_MODEL]
                [--ema-update-freq EMA_UPDATE_FREQ] [--ema-fp32]
                [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--dropout D] [--attention-dropout D] [--activation-dropout D]
                [--encoder-embed-path STR] [--encoder-embed-dim N]
                [--encoder-ffn-embed-dim N] [--encoder-layers N]
                [--encoder-attention-heads N] [--encoder-normalize-before]
                [--encoder-learned-pos] [--decoder-embed-path STR]
                [--decoder-embed-dim N] [--decoder-ffn-embed-dim N]
                [--decoder-layers N] [--decoder-attention-heads N]
                [--decoder-learned-pos] [--decoder-normalize-before]
                [--decoder-output-dim N] [--share-decoder-input-output-embed]
                [--share-all-embeddings] [--no-token-positional-embeddings]
                [--adaptive-softmax-cutoff EXPR]
                [--adaptive-softmax-dropout D] [--layernorm-embedding]
                [--no-scale-embedding] [--checkpoint-activations]
                [--offload-activations] [--no-cross-attention]
                [--cross-self-attention] [--encoder-layerdrop D]
                [--decoder-layerdrop D]
                [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
                [--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP]
                [--quant-noise-pq D] [--quant-noise-pq-block-size D]
                [--quant-noise-scalar D] [--min-params-to-wrap D]
                [--resnet-drop-path-rate RESNET_DROP_PATH_RATE]
                [--encoder-drop-path-rate ENCODER_DROP_PATH_RATE]
                [--decoder-drop-path-rate DECODER_DROP_PATH_RATE]
                [--token-bucket-size TOKEN_BUCKET_SIZE]
                [--image-bucket-size IMAGE_BUCKET_SIZE]
                [--attn-scale-factor ATTN_SCALE_FACTOR] [--freeze-resnet]
                [--freeze-encoder-embedding] [--freeze-decoder-embedding]
                [--add-type-embedding]
                [--resnet-type {resnet50,resnet101,resnet152}]
                [--resnet-model-path STR] [--code-image-size CODE_IMAGE_SIZE]
                [--patch-layernorm-embedding] [--code-layernorm-embedding]
                [--entangle-position-embedding] [--disable-entangle]
                [--sync-bn] [--scale-attn] [--scale-fc] [--scale-heads]
                [--scale-resids] [--pooler-dropout D]
                [--pooler-classifier {mlp,linear}]
                [--pooler-activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--spectral-norm-classification-head]
                [--selected-cols SELECTED_COLS] [--bpe-dir BPE_DIR]
                [--max-source-positions MAX_SOURCE_POSITIONS]
                [--max-target-positions MAX_TARGET_POSITIONS]
                [--max-src-length MAX_SRC_LENGTH]
                [--max-tgt-length MAX_TGT_LENGTH]
                [--code-dict-size CODE_DICT_SIZE]
                [--patch-image-size PATCH_IMAGE_SIZE] [--num-bins NUM_BINS]
                [--imagenet-default-mean-and-std]
                [--constraint-range CONSTRAINT_RANGE] [--eval-bleu]
                [--eval-cider] [--eval-args EVAL_ARGS] [--eval-print-samples]
                [--eval-cider-cached-tokens EVAL_CIDER_CACHED_TOKENS] [--scst]
                [--scst-args SCST_ARGS] [--label-smoothing LABEL_SMOOTHING]
                [--report-accuracy] [--ignore-prefix-size IGNORE_PREFIX_SIZE]
                [--ignore-eos] [--drop-worst-ratio DROP_WORST_RATIO]
                [--drop-worst-after DROP_WORST_AFTER] [--use-rdrop]
                [--reg-alpha REG_ALPHA] [--sample-patch-num SAMPLE_PATCH_NUM]
                [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS]
                [--weight-decay WEIGHT_DECAY] [--use-old-adam]
                [--fp16-adam-stats] [--warmup-updates WARMUP_UPDATES]
                [--force-anneal FORCE_ANNEAL]
                [--end-learning-rate END_LEARNING_RATE] [--power POWER]
                [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
                [--unk UNK]
                data
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
                [--aim-repo AIM_REPO] [--aim-run-hash AIM_RUN_HASH]
                [--tensorboard-logdir TENSORBOARD_LOGDIR]
                [--wandb-project WANDB_PROJECT] [--azureml-logging]
                [--seed SEED] [--cpu] [--tpu] [--bf16]
                [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16]
                [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE]
                [--fp16-scale-window FP16_SCALE_WINDOW]
                [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                [--on-cpu-convert-precision] [--min-loss-scale MIN_LOSS_SCALE]
                [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--amp]
                [--amp-batch-retries AMP_BATCH_RETRIES]
                [--amp-init-scale AMP_INIT_SCALE]
                [--amp-scale-window AMP_SCALE_WINDOW] [--user-dir USER_DIR]
                [--empty-cache-freq EMPTY_CACHE_FREQ]
                [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                [--model-parallel-size MODEL_PARALLEL_SIZE]
                [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                [--profile] [--reset-logging] [--suppress-crashes]
                [--use-plasma-view] [--plasma-path PLASMA_PATH]
                [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy,scst_reward_criterion,adjust_label_smoothed_cross_entropy,clip_scst_reward_criterion,adjust_label_smoothed_encouraging_loss}]
                [--tokenizer {moses,nltk,space}]
                [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                [--task TASK] [--num-workers NUM_WORKERS]
                [--skip-invalid-size-inputs-valid-test]
                [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
                [--data-buffer-size DATA_BUFFER_SIZE]
                [--train-subset TRAIN_SUBSET] [--valid-subset VALID_SUBSET]
                [--combine-valid-subsets] [--ignore-unused-valid-subsets]
                [--validate-interval VALIDATE_INTERVAL]
                [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                [--validate-after-updates VALIDATE_AFTER_UPDATES]
                [--fixed-validation-seed FIXED_VALIDATION_SEED]
                [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
                [--batch-size-valid BATCH_SIZE_VALID]
                [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM]
                [--gen-subset GEN_SUBSET] [--num-shards NUM_SHARDS]
                [--shard-id SHARD_ID] [--grouped-shuffling]
                [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
                [--update-ordered-indices-seed]
                [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                [--distributed-num-procs DISTRIBUTED_NUM_PROCS]
                [--distributed-rank DISTRIBUTED_RANK]
                [--distributed-backend DISTRIBUTED_BACKEND]
                [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
                [--distributed-no-spawn]
                [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
                [--ddp-comm-hook {none,fp16}] [--bucket-cap-mb BUCKET_CAP_MB]
                [--fix-batches-to-gpus] [--find-unused-parameters]
                [--gradient-as-bucket-view] [--fast-stat-sync]
                [--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers]
                [--slowmo-momentum SLOWMO_MOMENTUM]
                [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
                [--localsgd-frequency LOCALSGD_FREQUENCY]
                [--nprocs-per-node NPROCS_PER_NODE]
                [--pipeline-model-parallel]
                [--pipeline-balance PIPELINE_BALANCE]
                [--pipeline-devices PIPELINE_DEVICES]
                [--pipeline-chunks PIPELINE_CHUNKS]
                [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                [--pipeline-checkpoint {always,never,except_last}]
                [--zero-sharding {none,os}] [--no-reshard-after-forward]
                [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
                [--not-fsdp-flatten-parameters] [--arch ARCH]
                [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE]
                [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
                [--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR]
                [--stop-min-lr STOP_MIN_LR] [--use-bmuf]
                [--skip-remainder-batch] [--save-dir SAVE_DIR]
                [--restore-file RESTORE_FILE] [--continue-once CONTINUE_ONCE]
                [--finetune-from-model FINETUNE_FROM_MODEL]
                [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters]
                [--reset-optimizer]
                [--optimizer-overrides OPTIMIZER_OVERRIDES]
                [--save-interval SAVE_INTERVAL]
                [--save-interval-updates SAVE_INTERVAL_UPDATES]
                [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                [--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN]
                [--keep-last-epochs KEEP_LAST_EPOCHS]
                [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save]
                [--no-epoch-checkpoints] [--no-last-checkpoints]
                [--no-save-optimizer-state]
                [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                [--maximize-best-checkpoint-metric] [--patience PATIENCE]
                [--checkpoint-suffix CHECKPOINT_SUFFIX]
                [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                [--load-checkpoint-on-all-dp-ranks]
                [--write-checkpoints-asynchronously] [--store-ema]
                [--ema-decay EMA_DECAY] [--ema-start-update EMA_START_UPDATE]
                [--ema-seed-model EMA_SEED_MODEL]
                [--ema-update-freq EMA_UPDATE_FREQ] [--ema-fp32]
                [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--dropout D] [--attention-dropout D] [--activation-dropout D]
                [--encoder-embed-path STR] [--encoder-embed-dim N]
                [--encoder-ffn-embed-dim N] [--encoder-layers N]
                [--encoder-attention-heads N] [--encoder-normalize-before]
                [--encoder-learned-pos] [--decoder-embed-path STR]
                [--decoder-embed-dim N] [--decoder-ffn-embed-dim N]
                [--decoder-layers N] [--decoder-attention-heads N]
                [--decoder-learned-pos] [--decoder-normalize-before]
                [--decoder-output-dim N] [--share-decoder-input-output-embed]
                [--share-all-embeddings] [--no-token-positional-embeddings]
                [--adaptive-softmax-cutoff EXPR]
                [--adaptive-softmax-dropout D] [--layernorm-embedding]
                [--no-scale-embedding] [--checkpoint-activations]
                [--offload-activations] [--no-cross-attention]
                [--cross-self-attention] [--encoder-layerdrop D]
                [--decoder-layerdrop D]
                [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
                [--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP]
                [--quant-noise-pq D] [--quant-noise-pq-block-size D]
                [--quant-noise-scalar D] [--min-params-to-wrap D]
                [--resnet-drop-path-rate RESNET_DROP_PATH_RATE]
                [--encoder-drop-path-rate ENCODER_DROP_PATH_RATE]
                [--decoder-drop-path-rate DECODER_DROP_PATH_RATE]
                [--token-bucket-size TOKEN_BUCKET_SIZE]
                [--image-bucket-size IMAGE_BUCKET_SIZE]
                [--attn-scale-factor ATTN_SCALE_FACTOR] [--freeze-resnet]
                [--freeze-encoder-embedding] [--freeze-decoder-embedding]
                [--add-type-embedding]
                [--resnet-type {resnet50,resnet101,resnet152}]
                [--resnet-model-path STR] [--code-image-size CODE_IMAGE_SIZE]
                [--patch-layernorm-embedding] [--code-layernorm-embedding]
                [--entangle-position-embedding] [--disable-entangle]
                [--sync-bn] [--scale-attn] [--scale-fc] [--scale-heads]
                [--scale-resids] [--pooler-dropout D]
                [--pooler-classifier {mlp,linear}]
                [--pooler-activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--spectral-norm-classification-head]
                [--selected-cols SELECTED_COLS] [--bpe-dir BPE_DIR]
                [--max-source-positions MAX_SOURCE_POSITIONS]
                [--max-target-positions MAX_TARGET_POSITIONS]
                [--max-src-length MAX_SRC_LENGTH]
                [--max-tgt-length MAX_TGT_LENGTH]
                [--code-dict-size CODE_DICT_SIZE]
                [--patch-image-size PATCH_IMAGE_SIZE] [--num-bins NUM_BINS]
                [--imagenet-default-mean-and-std]
                [--constraint-range CONSTRAINT_RANGE] [--eval-bleu]
                [--eval-cider] [--eval-args EVAL_ARGS] [--eval-print-samples]
                [--eval-cider-cached-tokens EVAL_CIDER_CACHED_TOKENS] [--scst]
                [--scst-args SCST_ARGS] [--label-smoothing LABEL_SMOOTHING]
                [--report-accuracy] [--ignore-prefix-size IGNORE_PREFIX_SIZE]
                [--ignore-eos] [--drop-worst-ratio DROP_WORST_RATIO]
                [--drop-worst-after DROP_WORST_AFTER] [--use-rdrop]
                [--reg-alpha REG_ALPHA] [--sample-patch-num SAMPLE_PATCH_NUM]
                [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS]
                [--weight-decay WEIGHT_DECAY] [--use-old-adam]
                [--fp16-adam-stats] [--warmup-updates WARMUP_UPDATES]
                [--force-anneal FORCE_ANNEAL]
                [--end-learning-rate END_LEARNING_RATE] [--power POWER]
                [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
                [--unk UNK]
                data
train.py: error: unrecognized arguments: --warmup-ratio=0.06
train.py: error: unrecognized arguments: --warmup-ratio=0.06
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
                [--aim-repo AIM_REPO] [--aim-run-hash AIM_RUN_HASH]
                [--tensorboard-logdir TENSORBOARD_LOGDIR]
                [--wandb-project WANDB_PROJECT] [--azureml-logging]
                [--seed SEED] [--cpu] [--tpu] [--bf16]
                [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16]
                [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE]
                [--fp16-scale-window FP16_SCALE_WINDOW]
                [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                [--on-cpu-convert-precision] [--min-loss-scale MIN_LOSS_SCALE]
                [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--amp]
                [--amp-batch-retries AMP_BATCH_RETRIES]
                [--amp-init-scale AMP_INIT_SCALE]
                [--amp-scale-window AMP_SCALE_WINDOW] [--user-dir USER_DIR]
                [--empty-cache-freq EMPTY_CACHE_FREQ]
                [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                [--model-parallel-size MODEL_PARALLEL_SIZE]
                [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                [--profile] [--reset-logging] [--suppress-crashes]
                [--use-plasma-view] [--plasma-path PLASMA_PATH]
                [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy,scst_reward_criterion,adjust_label_smoothed_cross_entropy,clip_scst_reward_criterion,adjust_label_smoothed_encouraging_loss}]
                [--tokenizer {moses,nltk,space}]
                [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                [--task TASK] [--num-workers NUM_WORKERS]
                [--skip-invalid-size-inputs-valid-test]
                [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
                [--data-buffer-size DATA_BUFFER_SIZE]
                [--train-subset TRAIN_SUBSET] [--valid-subset VALID_SUBSET]
                [--combine-valid-subsets] [--ignore-unused-valid-subsets]
                [--validate-interval VALIDATE_INTERVAL]
                [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                [--validate-after-updates VALIDATE_AFTER_UPDATES]
                [--fixed-validation-seed FIXED_VALIDATION_SEED]
                [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
                [--batch-size-valid BATCH_SIZE_VALID]
                [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM]
                [--gen-subset GEN_SUBSET] [--num-shards NUM_SHARDS]
                [--shard-id SHARD_ID] [--grouped-shuffling]
                [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
                [--update-ordered-indices-seed]
                [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                [--distributed-num-procs DISTRIBUTED_NUM_PROCS]
                [--distributed-rank DISTRIBUTED_RANK]
                [--distributed-backend DISTRIBUTED_BACKEND]
                [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
                [--distributed-no-spawn]
                [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
                [--ddp-comm-hook {none,fp16}] [--bucket-cap-mb BUCKET_CAP_MB]
                [--fix-batches-to-gpus] [--find-unused-parameters]
                [--gradient-as-bucket-view] [--fast-stat-sync]
                [--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers]
                [--slowmo-momentum SLOWMO_MOMENTUM]
                [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
                [--localsgd-frequency LOCALSGD_FREQUENCY]
                [--nprocs-per-node NPROCS_PER_NODE]
                [--pipeline-model-parallel]
                [--pipeline-balance PIPELINE_BALANCE]
                [--pipeline-devices PIPELINE_DEVICES]
                [--pipeline-chunks PIPELINE_CHUNKS]
                [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                [--pipeline-checkpoint {always,never,except_last}]
                [--zero-sharding {none,os}] [--no-reshard-after-forward]
                [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
                [--not-fsdp-flatten-parameters] [--arch ARCH]
                [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE]
                [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
                [--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR]
                [--stop-min-lr STOP_MIN_LR] [--use-bmuf]
                [--skip-remainder-batch] [--save-dir SAVE_DIR]
                [--restore-file RESTORE_FILE] [--continue-once CONTINUE_ONCE]
                [--finetune-from-model FINETUNE_FROM_MODEL]
                [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters]
                [--reset-optimizer]
                [--optimizer-overrides OPTIMIZER_OVERRIDES]
                [--save-interval SAVE_INTERVAL]
                [--save-interval-updates SAVE_INTERVAL_UPDATES]
                [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                [--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN]
                [--keep-last-epochs KEEP_LAST_EPOCHS]
                [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save]
                [--no-epoch-checkpoints] [--no-last-checkpoints]
                [--no-save-optimizer-state]
                [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                [--maximize-best-checkpoint-metric] [--patience PATIENCE]
                [--checkpoint-suffix CHECKPOINT_SUFFIX]
                [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                [--load-checkpoint-on-all-dp-ranks]
                [--write-checkpoints-asynchronously] [--store-ema]
                [--ema-decay EMA_DECAY] [--ema-start-update EMA_START_UPDATE]
                [--ema-seed-model EMA_SEED_MODEL]
                [--ema-update-freq EMA_UPDATE_FREQ] [--ema-fp32]
                [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--dropout D] [--attention-dropout D] [--activation-dropout D]
                [--encoder-embed-path STR] [--encoder-embed-dim N]
                [--encoder-ffn-embed-dim N] [--encoder-layers N]
                [--encoder-attention-heads N] [--encoder-normalize-before]
                [--encoder-learned-pos] [--decoder-embed-path STR]
                [--decoder-embed-dim N] [--decoder-ffn-embed-dim N]
                [--decoder-layers N] [--decoder-attention-heads N]
                [--decoder-learned-pos] [--decoder-normalize-before]
                [--decoder-output-dim N] [--share-decoder-input-output-embed]
                [--share-all-embeddings] [--no-token-positional-embeddings]
                [--adaptive-softmax-cutoff EXPR]
                [--adaptive-softmax-dropout D] [--layernorm-embedding]
                [--no-scale-embedding] [--checkpoint-activations]
                [--offload-activations] [--no-cross-attention]
                [--cross-self-attention] [--encoder-layerdrop D]
                [--decoder-layerdrop D]
                [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
                [--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP]
                [--quant-noise-pq D] [--quant-noise-pq-block-size D]
                [--quant-noise-scalar D] [--min-params-to-wrap D]
                [--resnet-drop-path-rate RESNET_DROP_PATH_RATE]
                [--encoder-drop-path-rate ENCODER_DROP_PATH_RATE]
                [--decoder-drop-path-rate DECODER_DROP_PATH_RATE]
                [--token-bucket-size TOKEN_BUCKET_SIZE]
                [--image-bucket-size IMAGE_BUCKET_SIZE]
                [--attn-scale-factor ATTN_SCALE_FACTOR] [--freeze-resnet]
                [--freeze-encoder-embedding] [--freeze-decoder-embedding]
                [--add-type-embedding]
                [--resnet-type {resnet50,resnet101,resnet152}]
                [--resnet-model-path STR] [--code-image-size CODE_IMAGE_SIZE]
                [--patch-layernorm-embedding] [--code-layernorm-embedding]
                [--entangle-position-embedding] [--disable-entangle]
                [--sync-bn] [--scale-attn] [--scale-fc] [--scale-heads]
                [--scale-resids] [--pooler-dropout D]
                [--pooler-classifier {mlp,linear}]
                [--pooler-activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--spectral-norm-classification-head]
                [--selected-cols SELECTED_COLS] [--bpe-dir BPE_DIR]
                [--max-source-positions MAX_SOURCE_POSITIONS]
                [--max-target-positions MAX_TARGET_POSITIONS]
                [--max-src-length MAX_SRC_LENGTH]
                [--max-tgt-length MAX_TGT_LENGTH]
                [--code-dict-size CODE_DICT_SIZE]
                [--patch-image-size PATCH_IMAGE_SIZE] [--num-bins NUM_BINS]
                [--imagenet-default-mean-and-std]
                [--constraint-range CONSTRAINT_RANGE] [--eval-bleu]
                [--eval-cider] [--eval-args EVAL_ARGS] [--eval-print-samples]
                [--eval-cider-cached-tokens EVAL_CIDER_CACHED_TOKENS] [--scst]
                [--scst-args SCST_ARGS] [--label-smoothing LABEL_SMOOTHING]
                [--report-accuracy] [--ignore-prefix-size IGNORE_PREFIX_SIZE]
                [--ignore-eos] [--drop-worst-ratio DROP_WORST_RATIO]
                [--drop-worst-after DROP_WORST_AFTER] [--use-rdrop]
                [--reg-alpha REG_ALPHA] [--sample-patch-num SAMPLE_PATCH_NUM]
                [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS]
                [--weight-decay WEIGHT_DECAY] [--use-old-adam]
                [--fp16-adam-stats] [--warmup-updates WARMUP_UPDATES]
                [--force-anneal FORCE_ANNEAL]
                [--end-learning-rate END_LEARNING_RATE] [--power POWER]
                [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
                [--unk UNK]
                data
train.py: error: unrecognized arguments: --warmup-ratio=0.06
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
                [--aim-repo AIM_REPO] [--aim-run-hash AIM_RUN_HASH]
                [--tensorboard-logdir TENSORBOARD_LOGDIR]
                [--wandb-project WANDB_PROJECT] [--azureml-logging]
                [--seed SEED] [--cpu] [--tpu] [--bf16]
                [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16]
                [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE]
                [--fp16-scale-window FP16_SCALE_WINDOW]
                [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                [--on-cpu-convert-precision] [--min-loss-scale MIN_LOSS_SCALE]
                [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--amp]
                [--amp-batch-retries AMP_BATCH_RETRIES]
                [--amp-init-scale AMP_INIT_SCALE]
                [--amp-scale-window AMP_SCALE_WINDOW] [--user-dir USER_DIR]
                [--empty-cache-freq EMPTY_CACHE_FREQ]
                [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                [--model-parallel-size MODEL_PARALLEL_SIZE]
                [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                [--profile] [--reset-logging] [--suppress-crashes]
                [--use-plasma-view] [--plasma-path PLASMA_PATH]
                [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy,scst_reward_criterion,adjust_label_smoothed_cross_entropy,clip_scst_reward_criterion,adjust_label_smoothed_encouraging_loss}]
                [--tokenizer {moses,nltk,space}]
                [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                [--task TASK] [--num-workers NUM_WORKERS]
                [--skip-invalid-size-inputs-valid-test]
                [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
                [--data-buffer-size DATA_BUFFER_SIZE]
                [--train-subset TRAIN_SUBSET] [--valid-subset VALID_SUBSET]
                [--combine-valid-subsets] [--ignore-unused-valid-subsets]
                [--validate-interval VALIDATE_INTERVAL]
                [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                [--validate-after-updates VALIDATE_AFTER_UPDATES]
                [--fixed-validation-seed FIXED_VALIDATION_SEED]
                [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
                [--batch-size-valid BATCH_SIZE_VALID]
                [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM]
                [--gen-subset GEN_SUBSET] [--num-shards NUM_SHARDS]
                [--shard-id SHARD_ID] [--grouped-shuffling]
                [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
                [--update-ordered-indices-seed]
                [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                [--distributed-num-procs DISTRIBUTED_NUM_PROCS]
                [--distributed-rank DISTRIBUTED_RANK]
                [--distributed-backend DISTRIBUTED_BACKEND]
                [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
                [--distributed-no-spawn]
                [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
                [--ddp-comm-hook {none,fp16}] [--bucket-cap-mb BUCKET_CAP_MB]
                [--fix-batches-to-gpus] [--find-unused-parameters]
                [--gradient-as-bucket-view] [--fast-stat-sync]
                [--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers]
                [--slowmo-momentum SLOWMO_MOMENTUM]
                [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
                [--localsgd-frequency LOCALSGD_FREQUENCY]
                [--nprocs-per-node NPROCS_PER_NODE]
                [--pipeline-model-parallel]
                [--pipeline-balance PIPELINE_BALANCE]
                [--pipeline-devices PIPELINE_DEVICES]
                [--pipeline-chunks PIPELINE_CHUNKS]
                [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                [--pipeline-checkpoint {always,never,except_last}]
                [--zero-sharding {none,os}] [--no-reshard-after-forward]
                [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
                [--not-fsdp-flatten-parameters] [--arch ARCH]
                [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE]
                [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
                [--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR]
                [--stop-min-lr STOP_MIN_LR] [--use-bmuf]
                [--skip-remainder-batch] [--save-dir SAVE_DIR]
                [--restore-file RESTORE_FILE] [--continue-once CONTINUE_ONCE]
                [--finetune-from-model FINETUNE_FROM_MODEL]
                [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters]
                [--reset-optimizer]
                [--optimizer-overrides OPTIMIZER_OVERRIDES]
                [--save-interval SAVE_INTERVAL]
                [--save-interval-updates SAVE_INTERVAL_UPDATES]
                [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                [--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN]
                [--keep-last-epochs KEEP_LAST_EPOCHS]
                [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save]
                [--no-epoch-checkpoints] [--no-last-checkpoints]
                [--no-save-optimizer-state]
                [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                [--maximize-best-checkpoint-metric] [--patience PATIENCE]
                [--checkpoint-suffix CHECKPOINT_SUFFIX]
                [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                [--load-checkpoint-on-all-dp-ranks]
                [--write-checkpoints-asynchronously] [--store-ema]
                [--ema-decay EMA_DECAY] [--ema-start-update EMA_START_UPDATE]
                [--ema-seed-model EMA_SEED_MODEL]
                [--ema-update-freq EMA_UPDATE_FREQ] [--ema-fp32]
                [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--dropout D] [--attention-dropout D] [--activation-dropout D]
                [--encoder-embed-path STR] [--encoder-embed-dim N]
                [--encoder-ffn-embed-dim N] [--encoder-layers N]
                [--encoder-attention-heads N] [--encoder-normalize-before]
                [--encoder-learned-pos] [--decoder-embed-path STR]
                [--decoder-embed-dim N] [--decoder-ffn-embed-dim N]
                [--decoder-layers N] [--decoder-attention-heads N]
                [--decoder-learned-pos] [--decoder-normalize-before]
                [--decoder-output-dim N] [--share-decoder-input-output-embed]
                [--share-all-embeddings] [--no-token-positional-embeddings]
                [--adaptive-softmax-cutoff EXPR]
                [--adaptive-softmax-dropout D] [--layernorm-embedding]
                [--no-scale-embedding] [--checkpoint-activations]
                [--offload-activations] [--no-cross-attention]
                [--cross-self-attention] [--encoder-layerdrop D]
                [--decoder-layerdrop D]
                [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
                [--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP]
                [--quant-noise-pq D] [--quant-noise-pq-block-size D]
                [--quant-noise-scalar D] [--min-params-to-wrap D]
                [--resnet-drop-path-rate RESNET_DROP_PATH_RATE]
                [--encoder-drop-path-rate ENCODER_DROP_PATH_RATE]
                [--decoder-drop-path-rate DECODER_DROP_PATH_RATE]
                [--token-bucket-size TOKEN_BUCKET_SIZE]
                [--image-bucket-size IMAGE_BUCKET_SIZE]
                [--attn-scale-factor ATTN_SCALE_FACTOR] [--freeze-resnet]
                [--freeze-encoder-embedding] [--freeze-decoder-embedding]
                [--add-type-embedding]
                [--resnet-type {resnet50,resnet101,resnet152}]
                [--resnet-model-path STR] [--code-image-size CODE_IMAGE_SIZE]
                [--patch-layernorm-embedding] [--code-layernorm-embedding]
                [--entangle-position-embedding] [--disable-entangle]
                [--sync-bn] [--scale-attn] [--scale-fc] [--scale-heads]
                [--scale-resids] [--pooler-dropout D]
                [--pooler-classifier {mlp,linear}]
                [--pooler-activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--spectral-norm-classification-head]
                [--selected-cols SELECTED_COLS] [--bpe-dir BPE_DIR]
                [--max-source-positions MAX_SOURCE_POSITIONS]
                [--max-target-positions MAX_TARGET_POSITIONS]
                [--max-src-length MAX_SRC_LENGTH]
                [--max-tgt-length MAX_TGT_LENGTH]
                [--code-dict-size CODE_DICT_SIZE]
                [--patch-image-size PATCH_IMAGE_SIZE] [--num-bins NUM_BINS]
                [--imagenet-default-mean-and-std]
                [--constraint-range CONSTRAINT_RANGE] [--eval-bleu]
                [--eval-cider] [--eval-args EVAL_ARGS] [--eval-print-samples]
                [--eval-cider-cached-tokens EVAL_CIDER_CACHED_TOKENS] [--scst]
                [--scst-args SCST_ARGS] [--label-smoothing LABEL_SMOOTHING]
                [--report-accuracy] [--ignore-prefix-size IGNORE_PREFIX_SIZE]
                [--ignore-eos] [--drop-worst-ratio DROP_WORST_RATIO]
                [--drop-worst-after DROP_WORST_AFTER] [--use-rdrop]
                [--reg-alpha REG_ALPHA] [--sample-patch-num SAMPLE_PATCH_NUM]
                [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS]
                [--weight-decay WEIGHT_DECAY] [--use-old-adam]
                [--fp16-adam-stats] [--warmup-updates WARMUP_UPDATES]
                [--force-anneal FORCE_ANNEAL]
                [--end-learning-rate END_LEARNING_RATE] [--power POWER]
                [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
                [--unk UNK]
                data
train.py: error: unrecognized arguments: --warmup-ratio=0.06
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 31799) of binary: /home/XXXX/.conda/envs/ofa/bin/python
Traceback (most recent call last):
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
    )(*cmd_args)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
../../train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-04-30_22:19:49
  host      : hartley
  rank      : 1 (local_rank: 1)
  exitcode  : 2 (pid: 31800)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2022-04-30_22:19:49
  host      : hartley
  rank      : 2 (local_rank: 2)
  exitcode  : 2 (pid: 31801)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2022-04-30_22:19:49
  host      : hartley
  rank      : 3 (local_rank: 3)
  exitcode  : 2 (pid: 31802)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-04-30_22:19:49
  host      : hartley
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 31799)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
AI-EnabledSoftwareEngineering-AISE commented 2 years ago

Also, this is my setup to just test your training.

export MASTER_PORT=1051
log_dir=./stage1_logs
save_dir=./stage1_checkpoints
mkdir -p $log_dir $save_dir

bpe_dir=../../utils/BPE
user_dir=../../ofa_module

# data_dir=../../dataset/caption_data
data_dir=/raid/AISSEL/YYYY/datasets/caption_data
data=${data_dir}/caption_stage1_train.tsv,${data_dir}/caption_val.tsv
restore_file=../../checkpoints/ofa_large.pt
selected_cols=0,4,2

task=caption
arch=ofa_large
criterion=adjust_label_smoothed_cross_entropy
label_smoothing=0.1
lr=1e-5
max_epoch=2
warmup_ratio=0.06
batch_size=8
update_freq=4
resnet_drop_path_rate=0.0
encoder_drop_path_rate=0.1
decoder_drop_path_rate=0.1
dropout=0.1
attention_dropout=0.0
max_src_length=80
max_tgt_length=20
num_bins=1000
drop_worst_after=2500
patch_image_size=480
eval_cider_cached=${data_dir}/cider_cached_tokens/coco-valid-words.p
drop_worst_ratio=0.2

CUDA_VISIBLE_DEVICES=0,1 ~/.conda/envs/ofa/bin/python -m torch.distributed.launch --nproc_per_node=4 --master_port=${MASTER_PORT} ../../train.py \
          $data \
          --selected-cols=${selected_cols} \
          --bpe-dir=${bpe_dir} \
          --user-dir=${user_dir} \
          --restore-file=${restore_file} \
          --reset-optimizer --reset-dataloader --reset-meters \
          --save-dir=${save_path} \
          --task=${task} \
          --arch=${arch} \
          --criterion=${criterion} \
          --label-smoothing=${label_smoothing} \
          --batch-size=${batch_size} \
          --update-freq=${update_freq} \
          --encoder-normalize-before \
          --decoder-normalize-before \
          --share-decoder-input-output-embed \
          --share-all-embeddings \
          --layernorm-embedding \
          --patch-layernorm-embedding \
          --code-layernorm-embedding \
          --resnet-drop-path-rate=${resnet_drop_path_rate} \
          --encoder-drop-path-rate=${encoder_drop_path_rate} \
          --decoder-drop-path-rate=${decoder_drop_path_rate} \
          --dropout=${dropout} \
          --attention-dropout=${attention_dropout} \
          --weight-decay=0.01 --optimizer=adam --adam-betas="(0.9,0.999)" --adam-eps=1e-08 --clip-norm=1.0 \
          --lr-scheduler=polynomial_decay --lr=${lr} \
          --max-epoch=${max_epoch} --warmup-ratio=${warmup_ratio} \
          --log-format=simple --log-interval=10 \
          --fixed-validation-seed=7 \
          --no-epoch-checkpoints --keep-best-checkpoints=1 \
          --save-interval=1 --validate-interval=1 \
          --save-interval-updates=500 --validate-interval-updates=500 \
          --eval-cider \
          --eval-cider-cached-tokens=${eval_cider_cached} \
          --eval-args='{"beam":5,"max_len_b":16,"no_repeat_ngram_size":3}' \
          --best-checkpoint-metric=cider --maximize-best-checkpoint-metric \
          --max-src-length=${max_src_length} \
          --max-tgt-length=${max_tgt_length} \
          --find-unused-parameters \
          --freeze-encoder-embedding \
          --freeze-decoder-embedding \
          --add-type-embedding \
          --scale-attn \
          --scale-fc \
          --scale-heads \
          --disable-entangle \
          --num-bins=${num_bins} \
          --patch-image-size=${patch_image_size} \
          --drop-worst-ratio=${drop_worst_ratio} \
          --drop-worst-after=${drop_worst_after} \
          --fp16 \
          --fp16-scale-window=512 \
          --num-workers=0 #> ${log_file} 2>&1
logicwong commented 2 years ago

@AI-EnabledSoftwareEngineering-AISE Hi, did you use the fairseq library from our repo? The argument warmup-ratio is in OFA/fairseq/fairseq/optim/lr_scheduler/polynomial_decay_schedule.py

AI-EnabledSoftwareEngineering-AISE commented 2 years ago

Thank you, it was becuse of that I used fairseq from other source.

zh-zhang1984 commented 3 months ago

Hi,Can anyone give some suggestions to solve this issue? how can I shift to use the fairseq library from current repo?