Closed huhuhuqia closed 1 year ago
python版本是3.8.13
import torch print(torch.version.cuda) 11.7 print(torch.version) 1.13.0a0+340c412
这是相应shell文件的参数配置: GPUS_PER_NODE=1
WORKER_CNT=1
export MASTER_ADDR=127.0.0.1
export MASTER_PORT=8514
export RANK=0
export PYTHONPATH=${PYTHONPATH}:pwd
/cn_clip/
DATAPATH=${1}
train_data=${DATAPATH}/datasets/MUGE/lmdb/train val_data=${DATAPATH}/datasets/MUGE/lmdb/valid # if val_data is not specified, the validation will be automatically disabled
resume=${DATAPATH}/pretrained_weights/clip_cn_vit-b-16.pt # or specify your customed ckpt path to resume reset_data_offset="--reset-data-offset" reset_optimizer="--reset-optimizer"
output_base_dir=${DATAPATH}/experiments/ name=muge_finetune_vit-b-16_roberta-base_bs128_8gpu save_step_frequency=999999 # disable it save_epoch_frequency=1 log_interval=1 report_training_batch_acc="--report-training-batch-acc"
context_length=52 warmup=100 batch_size=128 valid_batch_size=128 accum_freq=1 lr=5e-5 wd=0.001 max_epochs=3 valid_step_interval=150 valid_epoch_interval=1 vision_model=ViT-B-16 text_model=RoBERTa-wwm-ext-base-chinese use_augment="--use-augment"
python -m torch.distributed.launch --use_env --nproc_per_node=${GPUS_PER_NODE} --nnodes=${WORKER_CNT} --node_rank=${RANK} \ --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} cn_clip/training/main.py \ --train-data=${train_data} \ --val-data=${val_data} \ --resume=${resume} \ ${reset_data_offset} \ ${reset_optimizer} \ --logs=${output_base_dir} \ --name=${name} \ --save-step-frequency=${save_step_frequency} \ --save-epoch-frequency=${save_epoch_frequency} \ --log-interval=${log_interval} \ ${report_training_batch_acc} \ --context-length=${context_length} \ --warmup=${warmup} \ --batch-size=${batch_size} \ --valid-batch-size=${valid_batch_size} \ --valid-step-interval=${valid_step_interval} \ --valid-epoch-interval=${valid_epoch_interval} \ --accum-freq=${accum_freq} \ --lr=${lr} \ --wd=${wd} \ --max-epochs=${max_epochs} \ --vision-model=${vision_model} \ ${use_augment} \ --text-model=${text_model}
您好,判断应该是您的torch.version为1.13.0a0
,在执行v1 = [int(entry) for entry in version1.split("+")[0].split(".")]
的时候split(".")
最后出现了0a0
无法判断导致错误。
https://github.com/OFA-Sys/Chinese-CLIP/blob/2c38d03557e50eadc72972b272cebf840dbc34ea/cn_clip/training/main.py#L134
直接将这一行改为find_unused_parameters = False
应该就可以了
感谢🙇,已解决。
您好,我按照指示在相应文件夹下放置了数据集,配置好参数后运行shell脚本,但是出错了。麻烦您看一下是什么原因呢? bash run_scripts/muge_finetune_vit-b-16_rbt-base.sh datadata /opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects
--local_rank
argument to be set, please change it to read fromos.environ['LOCAL_RANK']
instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructionswarnings.warn( Loading vision model config from cn_clip/clip/model_configs/ViT-B-16.json Loading text model config from cn_clip/clip/model_configs/RoBERTa-wwm-ext-base-chinese.json Traceback (most recent call last): File "cn_clip/training/main.py", line 301, in
main()
File "cn_clip/training/main.py", line 134, in main
find_unused_parameters = torch_version_str_compare_lessequal(torch.version, "1.8.0")
File "cn_clip/training/main.py", line 40, in torch_version_str_compare_lessequal
v1 = [int(entry) for entry in version1.split("+")[0].split(".")]
File "cn_clip/training/main.py", line 40, in
v1 = [int(entry) for entry in version1.split("+")[0].split(".")]
ValueError: invalid literal for int() with base 10: '0a0'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 29322) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
cn_clip/training/main.py FAILED
Failures: