main.py: error: unrecognized arguments: --accum_freq=1

OFA-Sys / Chinese-CLIP

Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.

MIT License

4k stars 418 forks source link

错误报告如下，这是什么问题，该如何解决啊 usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc] [--batch-size BATCH_SIZE] [--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL] [--context-length CONTEXT_LENGTH] [--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY] [--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}] [--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH] [--grad-checkpointing] [--local_rank LOCAL_RANK] [--skip-aggregate] [--debug] [--seed SEED] main.py: error: unrecognized arguments: --accum_freq=1 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 3485082) of binary: /home/amax/.conda/envs/lxl/bin/python Traceback (most recent call last): File "/home/amax/.conda/envs/lxl/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')()) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-04-08_21:50:39 host : amax rank : 0 (local_rank: 0) exitcode : 2 (pid: 3485082) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================

脚本代码如下

#!/usr/bin/env

# Guide:
# This script supports distributed training on multi-gpu workers (as well as single-worker training). 
# Please set the options below according to the comments. 
# For multi-gpu workers training, these options should be manually set for each worker. 
# After setting the options, please run the script on each worker.
# Command: bash run_scripts/muge_finetune_vit-b-16_rbt-base.sh ${DATAPATH}

GPUS_PER_NODE=1
WORKER_CNT=1
export MASTER_ADDR="localhost"
export MASTER_PORT=8514
export RANK=0
export PYTHONPATH=${PYTHONPATH}:`pwd`/cn_clip/

DATAPATH="/home/amax/sdb1/lxl2/B-data"

# 指定LMDB格式的训练集和验证集路径（存放了LMDB格式的图片和图文对数据）
train_data=${DATAPATH}/datasets/Bdata/lmdb/train
val_data=${DATAPATH}/datasets/Bdata/lmdb/valid # if val_data is not specif  ied, the validation will be automatically disabled

# restore options
resume=${DATAPATH}/pretrained_weights/clip_cn_vit-b-16.pt # or specify your customed ckpt path to resume
reset_data_offset="--reset-data-offset"
reset_optimizer="--reset-optimizer"
# reset_optimizer=""

# 指定输出相关配置
output_base_dir=${DATAPATH}/experiments/
name="B_finetune_vit-b-16_roberta-base" # finetune超参、日志、ckpt将保存在../datapath/experiments/muge_finetune_vit-b-16_roberta-base_bs48_1gpu/
save_step_frequency=999999 # disable it
save_epoch_frequency=1 # 每轮保存一个finetune ckpt
log_interval=1 # 日志打印间隔步数
report_training_batch_acc="--report-training-batch-acc" # 训练中，报告训练batch的in-batch准确率

# 指定训练超参数
context_length=52 # 序列长度，这里指定为Chinese-CLIP默认的52
warmup=100  # warmup步数
batch_size=32 # 训练单卡batch size
valid_batch_size=32 # 验证单卡batch size
lr=5e-5  # 学习率，因为这里我们使用的对比学习batch size很小，所以对应的学习率也调低一些
accum_freq=1
wd=0.001
max_epochs=100
valid_step_interval=999999
valid_epoch_interval=1
vision_model="ViT-B-16"
text_model="RoBERTa-wwm-ext-base-chinese"
use_augment="--use-augment"

torchrun   --nproc_per_node=${GPUS_PER_NODE} --nnodes=${WORKER_CNT}   sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py \
          --train-data=${train_data} \
          --val-data=${val_data} \
          --resume=${resume} \
          ${reset_data_offset} \
          ${reset_optimizer} \
          --logs=${output_base_dir} \
          --name=${name} \
          --save-step-frequency=${save_step_frequency} \
          --save-epoch-frequency=${save_epoch_frequency} \
          --log-interval=${log_interval} \
          ${report_training_batch_acc} \
          --context-length=${context_length} \
          --warmup=${warmup} \
          --batch-size=${batch_size} \
          --valid-batch-size=${valid_batch_size} \
          --valid-step-interval=${valid_step_interval} \
          --valid-epoch-interval=${valid_epoch_interval} \
          --lr=${lr} \
          --accum_freq=${accum_freq} \
          --wd=${wd} \
          --max-epochs=${max_epochs} \
          --vision-model=${vision_model} \
          ${use_augment} \
          --text-model=${text_model} \
          --grad-checkpointing
``

OFA-Sys / Chinese-CLIP

main.py: error: unrecognized arguments: --accum_freq=1 #286