OFA-Sys / Chinese-CLIP

Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
MIT License
4k stars 418 forks source link

这个问题太折磨了,找不到解决方法,有没有大神看一下 #291

Open iWangTing opened 3 months ago

iWangTing commented 3 months ago

运行sh脚本总会出现未识别的参数main.py: error: unrecognized arguments: --accum-freq=1,脚本和示例一模一样

`usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc] [--batch-size BATCH_SIZE]
               [--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL] [--context-length CONTEXT_LENGTH]
               [--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY]
               [--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}]
               [--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH]
               [--grad-checkpointing] [--local_rank LOCAL_RANK] [--skip-aggregate] [--debug] [--seed SEED]
main.py: error: unrecognized arguments: --accum-freq=1
[2024-04-11 23:52:11,183] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 5808) of binary: /home/amax/.conda/envs/lxl/bin/python3
Traceback (most recent call last):
  File "/home/amax/.conda/envs/lxl/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/amax/.conda/envs/lxl/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/run.py", line 816, in <module>
    main()
  File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-11_23:52:11
  host      : amax
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 5808)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
`
ChesonHuang commented 3 months ago

可以试试下面的命令看看吗

先 cd sdb1/lxl2/Chinese-CLIP-master/

python cn_clip/training/main.py \ --train-data=${train_data} \ --val-data=${val_data} \ --resume=${resume} \ ${reset_data_offset} \ ${reset_optimizer} \ --logs=${output_base_dir} \ --name=${name} \ --save-step-frequency=${save_step_frequency} \ --save-epoch-frequency=${save_epoch_frequency} \ --log-interval=${log_interval} \ ${report_training_batch_acc} \ --context-length=${context_length} \ --warmup=${warmup} \ --batch-size=${batch_size} \ --valid-batch-size=${valid_batch_size} \ --valid-step-interval=${valid_step_interval} \ --valid-epoch-interval=${valid_epoch_interval} \ --lr=${lr} \ --accum_freq=${accum_freq} \ --wd=${wd} \ --max-epochs=${max_epochs} \ --vision-model=${vision_model} \ ${use_augment} \ --text-model=${text_model} \ --grad-checkpointing

你可以看看cn_clip/training/params.py文件, 搜索下accum-freq看看有没有这个参数

如果你要用分布式,也可以ps -ef | grep main检查下进程

iWangTing commented 3 months ago

可以试试下面的命令看看吗

先 cd sdb1/lxl2/Chinese-CLIP-master/

python cn_clip/training/main.py --train-data=${train_data} --val-data=${val_data} --resume=${resume} ${reset_data_offset} ${reset_optimizer} --logs=${output_base_dir} --name=${name} --save-step-frequency=${save_step_frequency} --save-epoch-frequency=${save_epoch_frequency} --log-interval=${log_interval} ${report_training_batch_acc} --context-length=${context_length} --warmup=${warmup} --batch-size=${batch_size} --valid-batch-size=${valid_batch_size} --valid-step-interval=${valid_step_interval} --valid-epoch-interval=${valid_epoch_interval} --lr=${lr} --accum_freq=${accum_freq} --wd=${wd} --max-epochs=${max_epochs} --vision-model=${vision_model} ${use_augment} --text-model=${text_model} --grad-checkpointing

你可以看看cn_clip/training/params.py文件, 搜索下accum-freq看看有没有这个参数

如果你要用分布式,也可以ps -ef | grep main检查下进程

(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ python cn_clip/training/main.py usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc] [--batch-size BATCH_SIZE] [--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL] [--context-length CONTEXT_LENGTH] [--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY] [--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}] [--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH] [--grad-checkpointing] [--local_rank LOCAL_RANK] [--skip-aggregate] [--debug] [--seed SEED] main.py: error: the following arguments are required: --train-data (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --train-data=${train_data} --train-data=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --val-data=${val_data} --val-data=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --resume=${resume} --resume=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${reset_data_offset} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${reset_optimizer} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --logs=${output_base_dir} --logs=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --name=${name} --name=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --save-step-frequency=${save_step_frequency} --save-step-frequency=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --save-epoch-frequency=${save_epoch_frequency} --save-epoch-frequency=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --log-interval=${log_interval} --log-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${report_training_batch_acc} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --context-length=${context_length} --context-length=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --warmup=${warmup} --warmup=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --batch-size=${batch_size} --batch-size=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-batch-size=${valid_batch_size} --valid-batch-size=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-step-interval=${valid_step_interval} --valid-step-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-epoch-interval=${valid_epoch_interval} --valid-epoch-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --lr=${lr} --lr=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --accum_freq=${accum_freq} --accum_freq=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --wd=${wd} --wd=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --max-epochs=${max_epochs} --max-epochs=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --vision-model=${vision_model} --vision-model=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${use_augment} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --text-model=${text_model} --text-model=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --grad-checkpointing --grad-checkpointing: command not found 您好,运行结果如上。另外,params.py中有accum-freq这个参数

(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ps -ef | grep main amax 7490 4067 0 12:41 pts/0 00:00:00 grep --color=auto main

ChesonHuang commented 3 months ago

可以试试下面的命令看看吗 先 cd sdb1/lxl2/Chinese-CLIP-master/ python cn_clip/training/main.py --train-data=${train_data} --val-data=${val_data} --resume=${resume} ${reset_data_offset} ${reset_optimizer} --logs=${output_base_dir} --name=${name} --save-step-frequency=${save_step_frequency} --save-epoch-frequency=${save_epoch_frequency} --log-interval=${log_interval} ${report_training_batch_acc} --context-length=${context_length} --warmup=${warmup} --batch-size=${batch_size} --valid-batch-size=${valid_batch_size} --valid-step-interval=${valid_step_interval} --valid-epoch-interval=${valid_epoch_interval} --lr=${lr} --accum_freq=${accum_freq} --wd=${wd} --max-epochs=${max_epochs} --vision-model=${vision_model} ${use_augment} --text-model=${text_model} --grad-checkpointing 你可以看看cn_clip/training/params.py文件, 搜索下accum-freq看看有没有这个参数 如果你要用分布式,也可以ps -ef | grep main检查下进程

(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ python cn_clip/training/main.py usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc] [--batch-size BATCH_SIZE] [--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL] [--context-length CONTEXT_LENGTH] [--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY] [--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}] [--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH] [--grad-checkpointing] [--local_rank LOCAL_RANK] [--skip-aggregate] [--debug] [--seed SEED] main.py: error: the following arguments are required: --train-data (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --train-data=${train_data} --train-data=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --val-data=${val_data} --val-data=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --resume=${resume} --resume=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${reset_data_offset} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${reset_optimizer} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --logs=${output_base_dir} --logs=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --name=${name} --name=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --save-step-frequency=${save_step_frequency} --save-step-frequency=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --save-epoch-frequency=${save_epoch_frequency} --save-epoch-frequency=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --log-interval=${log_interval} --log-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${report_training_batch_acc} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --context-length=${context_length} --context-length=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --warmup=${warmup} --warmup=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --batch-size=${batch_size} --batch-size=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-batch-size=${valid_batch_size} --valid-batch-size=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-step-interval=${valid_step_interval} --valid-step-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-epoch-interval=${valid_epoch_interval} --valid-epoch-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --lr=${lr} --lr=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --accum_freq=${accum_freq} --accum_freq=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --wd=${wd} --wd=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --max-epochs=${max_epochs} --max-epochs=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --vision-model=${vision_model} --vision-model=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${use_augment} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --text-model=${text_model} --text-model=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --grad-checkpointing --grad-checkpointing: command not found 您好,运行结果如上。另外,params.py中有accum-freq这个参数

(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ps -ef | grep main amax 7490 4067 0 12:41 pts/0 00:00:00 grep --color=auto main


把这个命令替换你sh脚本中原来的torchrun的命令执行,不是直接在终端这样执行,例如:把脚本中下面绿色的去到 clip

iWangTing commented 3 months ago

可以试试下面的命令看看吗 先 cd sdb1/lxl2/Chinese-CLIP-master/ python cn_clip/training/main.py --train-data=${train_data} --val-data=${val_data} --resume=${resume} ${reset_data_offset} ${reset_optimizer} --logs=${output_base_dir} --name=${name} --save-step-frequency=${save_step_frequency} --save-epoch-frequency=${save_epoch_frequency} --log-interval=${log_interval} ${report_training_batch_acc} --context-length=${context_length} --warmup=${warmup} --batch-size=${batch_size} --valid-batch-size=${valid_batch_size} --valid-step-interval=${valid_step_interval} --valid-epoch-interval=${valid_epoch_interval} --lr=${lr} --accum_freq=${accum_freq} --wd=${wd} --max-epochs=${max_epochs} --vision-model=${vision_model} ${use_augment} --text-model=${text_model} --grad-checkpointing 你可以看看cn_clip/training/params.py文件, 搜索下accum-freq看看有没有这个参数 如果你要用分布式,也可以ps -ef | grep main检查下进程

(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ python cn_clip/training/main.py usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc] [--batch-size BATCH_SIZE] [--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL] [--context-length CONTEXT_LENGTH] [--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY] [--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}] [--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH] [--grad-checkpointing] [--local_rank LOCAL_RANK] [--skip-aggregate] [--debug] [--seed SEED] main.py: error: the following arguments are required: --train-data (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --train-data=${train_data} --train-data=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --val-data=${val_data} --val-data=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --resume=${resume} --resume=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${reset_data_offset} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${reset_optimizer} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --logs=${output_base_dir} --logs=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --name=${name} --name=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --save-step-frequency=${save_step_frequency} --save-step-frequency=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --save-epoch-frequency=${save_epoch_frequency} --save-epoch-frequency=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --log-interval=${log_interval} --log-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${report_training_batch_acc} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --context-length=${context_length} --context-length=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --warmup=${warmup} --warmup=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --batch-size=${batch_size} --batch-size=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-batch-size=${valid_batch_size} --valid-batch-size=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-step-interval=${valid_step_interval} --valid-step-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-epoch-interval=${valid_epoch_interval} --valid-epoch-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --lr=${lr} --lr=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --accum_freq=${accum_freq} --accum_freq=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --wd=${wd} --wd=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --max-epochs=${max_epochs} --max-epochs=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --vision-model=${vision_model} --vision-model=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${use_augment} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --text-model=${text_model} --text-model=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --grad-checkpointing --grad-checkpointing: command not found 您好,运行结果如上。另外,params.py中有accum-freq这个参数


(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ps -ef | grep main amax 7490 4067 0 12:41 pts/0 00:00:00 grep --color=auto main

把这个命令替换你sh脚本中原来的torchrun的命令执行,不是直接在终端这样执行,例如:把脚本中下面绿色的去到 clip

(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ bash /home/amax/sdb1/lxl2/Chinese-CLIP-master/run_scripts/B_finetune_vit-b-16_rbt-base.sh Traceback (most recent call last): File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 16, in from cn_clip.clip import load File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/clip/init.py", line 4, in from .model import convert_state_dict File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/clip/model.py", line 16, in FlashMHA = importlib.import_module('flash_attn.flash_attention').FlashMHA File "/home/amax/.conda/envs/lxl/lib/python3.9/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/flash_attn/flash_attention.py", line 7, in from flash_attn.flash_attn_interface import flash_attn_unpadded_qkvpacked_func File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/flash_attn/flash_attn_interface.py", line 5, in import flash_attn_cuda ImportError: /home/amax/.conda/envs/lxl/lib/python3.9/site-packages/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE 这是按照您说的先cd后,再替换脚本中命令行后的结果

ChesonHuang commented 3 months ago

/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalI

你的linux-gnu.so的依赖有问题,请参考https://github.com/open-mmlab/mmdetection3d/issues/1152这里类似的解决办法

iWangTing commented 3 months ago

/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalI

你的linux-gnu.so的依赖有问题,请参考https://github.com/open-mmlab/mmdetection3d/issues/1152这里类似的解决办法

我根据1152的解决方法试过了,但还是不行。这个issues指的应该是mmcv的,但我这个是flash-attn的。 我又从flash-attn相关的issues上找了相关解决方法,还是不行,貌似flash-attn支持的torch是1.12以上的,我的是1.10,并且我也没有要用flash-attn,如何在代码中关闭或者忽略flash-attn相关的内容呢?

ChesonHuang commented 3 months ago

/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalI

你的linux-gnu.so的依赖有问题,请参考https://github.com/open-mmlab/mmdetection3d/issues/1152这里类似的解决办法

我根据1152的解决方法试过了,但还是不行。这个issues指的应该是mmcv的,但我这个是flash-attn的。 我又从flash-attn相关的issues上找了相关解决方法,还是不行,貌似flash-attn支持的torch是1.12以上的,我的是1.10,并且我也没有要用flash-attn,如何在代码中关闭或者忽略flash-attn相关的内容呢?

pip uninstall flash_attn image

iWangTing commented 3 months ago

/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalI

你的linux-gnu.so的依赖有问题,请参考https://github.com/open-mmlab/mmdetection3d/issues/1152这里类似的解决办法

我根据1152的解决方法试过了,但还是不行。这个issues指的应该是mmcv的,但我这个是flash-attn的。 我又从flash-attn相关的issues上找了相关解决方法,还是不行,貌似flash-attn支持的torch是1.12以上的,我的是1.10,并且我也没有要用flash-attn,如何在代码中关闭或者忽略flash-attn相关的内容呢?

pip uninstall flash_attn image

(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ bash /home/amax/sdb1/lxl2/Chinese-CLIP-master/run_scripts/B_finetune_vit-b-16_rbt-base.sh
usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--valid-num-workers VALID_NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc]
               [--batch-size BATCH_SIZE] [--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL]
               [--context-length CONTEXT_LENGTH] [--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY]
               [--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}]
               [--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH]
               [--grad-checkpointing] [--use-flash-attention] [--gather-with-grad] [--skip-aggregate] [--debug] [--seed SEED] [--distllation] [--teacher-model-name TEACHER_MODEL_NAME] [--kd_loss_weight KD_LOSS_WEIGHT]
               [--accum-freq ACCUM_FREQ]
main.py: error: unrecognized arguments: --accum_freq=1

额执行完您说的“先 cd sdb1/lxl2/Chinese-CLIP-master/...............”,出现了以上的报错,回到开始了属实是

ChesonHuang commented 3 months ago

accum_freq

shell脚本里面,将--accum_freq=xxx 改成 --accum-freq=xxx image

iWangTing commented 3 months ago

accum_freq

shell脚本里面,将--accum_freq=xxx 改成 --accum-freq=xxx image

Traceback (most recent call last):
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module>
    main()
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main
    args.local_device_rank = int(os.environ['LOCAL_RANK'])
  File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'

新的参数问题又出现了。。麻烦您再看一下

ChesonHuang commented 3 months ago

accum_freq

shell脚本里面,将--accum_freq=xxx 改成 --accum-freq=xxx image

Traceback (most recent call last):
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module>
    main()
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main
    args.local_device_rank = int(os.environ['LOCAL_RANK'])
  File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'

新的参数问题又出现了。。麻烦您再看一下

改法1:shell脚本里面,加上 image

改法2:main.py里面 image

iWangTing commented 3 months ago

accum_freq

shell脚本里面,将--accum_freq=xxx 改成 --accum-freq=xxx image

Traceback (most recent call last):
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module>
    main()
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main
    args.local_device_rank = int(os.environ['LOCAL_RANK'])
  File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'

新的参数问题又出现了。。麻烦您再看一下

改法1:shell脚本里面,加上 image

改法2:main.py里面 image

Traceback (most recent call last): File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module> main() File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 55, in main dist.init_process_group(backend="nccl") File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 224, in _env_rendezvous_handler world_size = int(_get_env_or_raise("WORLD_SIZE")) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 203, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set 问题接踵而至。。。

ChesonHuang commented 3 months ago

accum_freq

shell脚本里面,将--accum_freq=xxx 改成 --accum-freq=xxx image

Traceback (most recent call last):
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module>
    main()
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main
    args.local_device_rank = int(os.environ['LOCAL_RANK'])
  File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'

新的参数问题又出现了。。麻烦您再看一下

改法1:shell脚本里面,加上 image 改法2:main.py里面 image

Traceback (most recent call last): File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module> main() File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 55, in main dist.init_process_group(backend="nccl") File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 224, in _env_rendezvous_handler world_size = int(_get_env_or_raise("WORLD_SIZE")) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 203, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set 问题接踵而至。。。 报错里面说没有环境变量,环境变量可以像这样配置,加上 export WORLD_SIZE=xx 就可以 image

iWangTing commented 3 months ago

accum_freq

shell脚本里面,将--accum_freq=xxx 改成 --accum-freq=xxx image

Traceback (most recent call last):
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module>
    main()
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main
    args.local_device_rank = int(os.environ['LOCAL_RANK'])
  File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'

新的参数问题又出现了。。麻烦您再看一下

改法1:shell脚本里面,加上 image 改法2:main.py里面 image

Traceback (most recent call last): File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module> main() File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 55, in main dist.init_process_group(backend="nccl") File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 224, in _env_rendezvous_handler world_size = int(_get_env_or_raise("WORLD_SIZE")) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 203, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set 问题接踵而至。。。 报错里面说没有环境变量,环境变量可以像这样配置,加上 export WORLD_SIZE=xx 就可以 image

主要问题已经基本已经解决了,可以先训练了,感谢多日以来的耐心指导,感激之情溢于言表~[抱拳]