Open iWangTing opened 3 months ago
可以试试下面的命令看看吗
先 cd sdb1/lxl2/Chinese-CLIP-master/
python cn_clip/training/main.py \ --train-data=${train_data} \ --val-data=${val_data} \ --resume=${resume} \ ${reset_data_offset} \ ${reset_optimizer} \ --logs=${output_base_dir} \ --name=${name} \ --save-step-frequency=${save_step_frequency} \ --save-epoch-frequency=${save_epoch_frequency} \ --log-interval=${log_interval} \ ${report_training_batch_acc} \ --context-length=${context_length} \ --warmup=${warmup} \ --batch-size=${batch_size} \ --valid-batch-size=${valid_batch_size} \ --valid-step-interval=${valid_step_interval} \ --valid-epoch-interval=${valid_epoch_interval} \ --lr=${lr} \ --accum_freq=${accum_freq} \ --wd=${wd} \ --max-epochs=${max_epochs} \ --vision-model=${vision_model} \ ${use_augment} \ --text-model=${text_model} \ --grad-checkpointing
你可以看看cn_clip/training/params.py文件, 搜索下accum-freq看看有没有这个参数
如果你要用分布式,也可以ps -ef | grep main检查下进程
可以试试下面的命令看看吗
先 cd sdb1/lxl2/Chinese-CLIP-master/
python cn_clip/training/main.py --train-data=${train_data} --val-data=${val_data} --resume=${resume} ${reset_data_offset} ${reset_optimizer} --logs=${output_base_dir} --name=${name} --save-step-frequency=${save_step_frequency} --save-epoch-frequency=${save_epoch_frequency} --log-interval=${log_interval} ${report_training_batch_acc} --context-length=${context_length} --warmup=${warmup} --batch-size=${batch_size} --valid-batch-size=${valid_batch_size} --valid-step-interval=${valid_step_interval} --valid-epoch-interval=${valid_epoch_interval} --lr=${lr} --accum_freq=${accum_freq} --wd=${wd} --max-epochs=${max_epochs} --vision-model=${vision_model} ${use_augment} --text-model=${text_model} --grad-checkpointing
你可以看看cn_clip/training/params.py文件, 搜索下accum-freq看看有没有这个参数
如果你要用分布式,也可以ps -ef | grep main检查下进程
(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ python cn_clip/training/main.py usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc] [--batch-size BATCH_SIZE] [--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL] [--context-length CONTEXT_LENGTH] [--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY] [--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}] [--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH] [--grad-checkpointing] [--local_rank LOCAL_RANK] [--skip-aggregate] [--debug] [--seed SEED] main.py: error: the following arguments are required: --train-data (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --train-data=${train_data} --train-data=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --val-data=${val_data} --val-data=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --resume=${resume} --resume=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${reset_data_offset} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${reset_optimizer} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --logs=${output_base_dir} --logs=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --name=${name} --name=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --save-step-frequency=${save_step_frequency} --save-step-frequency=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --save-epoch-frequency=${save_epoch_frequency} --save-epoch-frequency=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --log-interval=${log_interval} --log-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${report_training_batch_acc} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --context-length=${context_length} --context-length=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --warmup=${warmup} --warmup=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --batch-size=${batch_size} --batch-size=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-batch-size=${valid_batch_size} --valid-batch-size=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-step-interval=${valid_step_interval} --valid-step-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-epoch-interval=${valid_epoch_interval} --valid-epoch-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --lr=${lr} --lr=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --accum_freq=${accum_freq} --accum_freq=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --wd=${wd} --wd=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --max-epochs=${max_epochs} --max-epochs=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --vision-model=${vision_model} --vision-model=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${use_augment} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --text-model=${text_model} --text-model=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --grad-checkpointing --grad-checkpointing: command not found 您好,运行结果如上。另外,params.py中有accum-freq这个参数
(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ps -ef | grep main amax 7490 4067 0 12:41 pts/0 00:00:00 grep --color=auto main
可以试试下面的命令看看吗 先 cd sdb1/lxl2/Chinese-CLIP-master/ python cn_clip/training/main.py --train-data=${train_data} --val-data=${val_data} --resume=${resume} ${reset_data_offset} ${reset_optimizer} --logs=${output_base_dir} --name=${name} --save-step-frequency=${save_step_frequency} --save-epoch-frequency=${save_epoch_frequency} --log-interval=${log_interval} ${report_training_batch_acc} --context-length=${context_length} --warmup=${warmup} --batch-size=${batch_size} --valid-batch-size=${valid_batch_size} --valid-step-interval=${valid_step_interval} --valid-epoch-interval=${valid_epoch_interval} --lr=${lr} --accum_freq=${accum_freq} --wd=${wd} --max-epochs=${max_epochs} --vision-model=${vision_model} ${use_augment} --text-model=${text_model} --grad-checkpointing 你可以看看cn_clip/training/params.py文件, 搜索下accum-freq看看有没有这个参数 如果你要用分布式,也可以ps -ef | grep main检查下进程
(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ python cn_clip/training/main.py usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc] [--batch-size BATCH_SIZE] [--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL] [--context-length CONTEXT_LENGTH] [--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY] [--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}] [--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH] [--grad-checkpointing] [--local_rank LOCAL_RANK] [--skip-aggregate] [--debug] [--seed SEED] main.py: error: the following arguments are required: --train-data (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --train-data=${train_data} --train-data=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --val-data=${val_data} --val-data=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --resume=${resume} --resume=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${reset_data_offset} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${reset_optimizer} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --logs=${output_base_dir} --logs=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --name=${name} --name=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --save-step-frequency=${save_step_frequency} --save-step-frequency=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --save-epoch-frequency=${save_epoch_frequency} --save-epoch-frequency=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --log-interval=${log_interval} --log-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${report_training_batch_acc} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --context-length=${context_length} --context-length=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --warmup=${warmup} --warmup=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --batch-size=${batch_size} --batch-size=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-batch-size=${valid_batch_size} --valid-batch-size=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-step-interval=${valid_step_interval} --valid-step-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-epoch-interval=${valid_epoch_interval} --valid-epoch-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --lr=${lr} --lr=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --accum_freq=${accum_freq} --accum_freq=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --wd=${wd} --wd=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --max-epochs=${max_epochs} --max-epochs=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --vision-model=${vision_model} --vision-model=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${use_augment} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --text-model=${text_model} --text-model=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --grad-checkpointing --grad-checkpointing: command not found 您好,运行结果如上。另外,params.py中有accum-freq这个参数
(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ps -ef | grep main amax 7490 4067 0 12:41 pts/0 00:00:00 grep --color=auto main
把这个命令替换你sh脚本中原来的torchrun的命令执行,不是直接在终端这样执行,例如:把脚本中下面绿色的去到
可以试试下面的命令看看吗 先 cd sdb1/lxl2/Chinese-CLIP-master/ python cn_clip/training/main.py --train-data=${train_data} --val-data=${val_data} --resume=${resume} ${reset_data_offset} ${reset_optimizer} --logs=${output_base_dir} --name=${name} --save-step-frequency=${save_step_frequency} --save-epoch-frequency=${save_epoch_frequency} --log-interval=${log_interval} ${report_training_batch_acc} --context-length=${context_length} --warmup=${warmup} --batch-size=${batch_size} --valid-batch-size=${valid_batch_size} --valid-step-interval=${valid_step_interval} --valid-epoch-interval=${valid_epoch_interval} --lr=${lr} --accum_freq=${accum_freq} --wd=${wd} --max-epochs=${max_epochs} --vision-model=${vision_model} ${use_augment} --text-model=${text_model} --grad-checkpointing 你可以看看cn_clip/training/params.py文件, 搜索下accum-freq看看有没有这个参数 如果你要用分布式,也可以ps -ef | grep main检查下进程
(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ python cn_clip/training/main.py usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc] [--batch-size BATCH_SIZE] [--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL] [--context-length CONTEXT_LENGTH] [--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY] [--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}] [--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH] [--grad-checkpointing] [--local_rank LOCAL_RANK] [--skip-aggregate] [--debug] [--seed SEED] main.py: error: the following arguments are required: --train-data (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --train-data=${train_data} --train-data=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --val-data=${val_data} --val-data=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --resume=${resume} --resume=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${reset_data_offset} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${reset_optimizer} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --logs=${output_base_dir} --logs=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --name=${name} --name=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --save-step-frequency=${save_step_frequency} --save-step-frequency=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --save-epoch-frequency=${save_epoch_frequency} --save-epoch-frequency=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --log-interval=${log_interval} --log-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${report_training_batch_acc} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --context-length=${context_length} --context-length=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --warmup=${warmup} --warmup=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --batch-size=${batch_size} --batch-size=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-batch-size=${valid_batch_size} --valid-batch-size=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-step-interval=${valid_step_interval} --valid-step-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --valid-epoch-interval=${valid_epoch_interval} --valid-epoch-interval=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --lr=${lr} --lr=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --accum_freq=${accum_freq} --accum_freq=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --wd=${wd} --wd=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --max-epochs=${max_epochs} --max-epochs=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --vision-model=${vision_model} --vision-model=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ${use_augment} (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --text-model=${text_model} --text-model=: command not found (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ --grad-checkpointing --grad-checkpointing: command not found 您好,运行结果如上。另外,params.py中有accum-freq这个参数
(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ps -ef | grep main amax 7490 4067 0 12:41 pts/0 00:00:00 grep --color=auto main
把这个命令替换你sh脚本中原来的torchrun的命令执行,不是直接在终端这样执行,例如:把脚本中下面绿色的去到
(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ bash /home/amax/sdb1/lxl2/Chinese-CLIP-master/run_scripts/B_finetune_vit-b-16_rbt-base.sh
Traceback (most recent call last):
File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 16, in
/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalI
你的linux-gnu.so的依赖有问题,请参考https://github.com/open-mmlab/mmdetection3d/issues/1152这里类似的解决办法
/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalI
你的linux-gnu.so的依赖有问题,请参考https://github.com/open-mmlab/mmdetection3d/issues/1152这里类似的解决办法
我根据1152的解决方法试过了,但还是不行。这个issues指的应该是mmcv的,但我这个是flash-attn的。 我又从flash-attn相关的issues上找了相关解决方法,还是不行,貌似flash-attn支持的torch是1.12以上的,我的是1.10,并且我也没有要用flash-attn,如何在代码中关闭或者忽略flash-attn相关的内容呢?
/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalI
你的linux-gnu.so的依赖有问题,请参考https://github.com/open-mmlab/mmdetection3d/issues/1152这里类似的解决办法
我根据1152的解决方法试过了,但还是不行。这个issues指的应该是mmcv的,但我这个是flash-attn的。 我又从flash-attn相关的issues上找了相关解决方法,还是不行,貌似flash-attn支持的torch是1.12以上的,我的是1.10,并且我也没有要用flash-attn,如何在代码中关闭或者忽略flash-attn相关的内容呢?
pip uninstall flash_attn
/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalI
你的linux-gnu.so的依赖有问题,请参考https://github.com/open-mmlab/mmdetection3d/issues/1152这里类似的解决办法
我根据1152的解决方法试过了,但还是不行。这个issues指的应该是mmcv的,但我这个是flash-attn的。 我又从flash-attn相关的issues上找了相关解决方法,还是不行,貌似flash-attn支持的torch是1.12以上的,我的是1.10,并且我也没有要用flash-attn,如何在代码中关闭或者忽略flash-attn相关的内容呢?
pip uninstall flash_attn
(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ bash /home/amax/sdb1/lxl2/Chinese-CLIP-master/run_scripts/B_finetune_vit-b-16_rbt-base.sh
usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--valid-num-workers VALID_NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc]
[--batch-size BATCH_SIZE] [--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL]
[--context-length CONTEXT_LENGTH] [--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY]
[--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}]
[--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH]
[--grad-checkpointing] [--use-flash-attention] [--gather-with-grad] [--skip-aggregate] [--debug] [--seed SEED] [--distllation] [--teacher-model-name TEACHER_MODEL_NAME] [--kd_loss_weight KD_LOSS_WEIGHT]
[--accum-freq ACCUM_FREQ]
main.py: error: unrecognized arguments: --accum_freq=1
额执行完您说的“先 cd sdb1/lxl2/Chinese-CLIP-master/...............”,出现了以上的报错,回到开始了属实是
accum_freq
shell脚本里面,将--accum_freq=xxx 改成 --accum-freq=xxx
accum_freq
shell脚本里面,将--accum_freq=xxx 改成 --accum-freq=xxx
Traceback (most recent call last):
File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module>
main()
File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main
args.local_device_rank = int(os.environ['LOCAL_RANK'])
File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__
raise KeyError(key) from None
KeyError: 'LOCAL_RANK'
新的参数问题又出现了。。麻烦您再看一下
accum_freq
shell脚本里面,将--accum_freq=xxx 改成 --accum-freq=xxx
Traceback (most recent call last): File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module> main() File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main args.local_device_rank = int(os.environ['LOCAL_RANK']) File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__ raise KeyError(key) from None KeyError: 'LOCAL_RANK'
新的参数问题又出现了。。麻烦您再看一下
改法1:shell脚本里面,加上
改法2:main.py里面
accum_freq
shell脚本里面,将--accum_freq=xxx 改成 --accum-freq=xxx
Traceback (most recent call last): File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module> main() File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main args.local_device_rank = int(os.environ['LOCAL_RANK']) File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__ raise KeyError(key) from None KeyError: 'LOCAL_RANK'
新的参数问题又出现了。。麻烦您再看一下
改法1:shell脚本里面,加上
改法2:main.py里面
Traceback (most recent call last): File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module> main() File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 55, in main dist.init_process_group(backend="nccl") File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 224, in _env_rendezvous_handler world_size = int(_get_env_or_raise("WORLD_SIZE")) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 203, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set
问题接踵而至。。。
accum_freq
shell脚本里面,将--accum_freq=xxx 改成 --accum-freq=xxx
Traceback (most recent call last): File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module> main() File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main args.local_device_rank = int(os.environ['LOCAL_RANK']) File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__ raise KeyError(key) from None KeyError: 'LOCAL_RANK'
新的参数问题又出现了。。麻烦您再看一下
改法1:shell脚本里面,加上
改法2:main.py里面
Traceback (most recent call last): File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module> main() File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 55, in main dist.init_process_group(backend="nccl") File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 224, in _env_rendezvous_handler world_size = int(_get_env_or_raise("WORLD_SIZE")) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 203, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set
问题接踵而至。。。 报错里面说没有环境变量,环境变量可以像这样配置,加上 export WORLD_SIZE=xx 就可以
accum_freq
shell脚本里面,将--accum_freq=xxx 改成 --accum-freq=xxx
Traceback (most recent call last): File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module> main() File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main args.local_device_rank = int(os.environ['LOCAL_RANK']) File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__ raise KeyError(key) from None KeyError: 'LOCAL_RANK'
新的参数问题又出现了。。麻烦您再看一下
改法1:shell脚本里面,加上
改法2:main.py里面
Traceback (most recent call last): File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module> main() File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 55, in main dist.init_process_group(backend="nccl") File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 224, in _env_rendezvous_handler world_size = int(_get_env_or_raise("WORLD_SIZE")) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 203, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set
问题接踵而至。。。 报错里面说没有环境变量,环境变量可以像这样配置,加上 export WORLD_SIZE=xx 就可以
主要问题已经基本已经解决了,可以先训练了,感谢多日以来的耐心指导,感激之情溢于言表~[抱拳]
运行sh脚本总会出现未识别的参数main.py: error: unrecognized arguments: --accum-freq=1,脚本和示例一模一样