脚本运行错误 - Githubissues

huhuhuqia commented 1 year ago

您好，我按照指示在相应文件夹下放置了数据集，配置好参数后运行shell脚本，但是出错了。麻烦您看一下是什么原因呢？ bash run_scripts/muge_finetune_vit-b-16_rbt-base.sh datadata /opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( Loading vision model config from cn_clip/clip/model_configs/ViT-B-16.json Loading text model config from cn_clip/clip/model_configs/RoBERTa-wwm-ext-base-chinese.json Traceback (most recent call last): File "cn_clip/training/main.py", line 301, in main() File "cn_clip/training/main.py", line 134, in main find_unused_parameters = torch_version_str_compare_lessequal(torch.version, "1.8.0") File "cn_clip/training/main.py", line 40, in torch_version_str_compare_lessequal v1 = [int(entry) for entry in version1.split("+")[0].split(".")] File "cn_clip/training/main.py", line 40, in v1 = [int(entry) for entry in version1.split("+")[0].split(".")] ValueError: invalid literal for int() with base 10: '0a0' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 29322) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

cn_clip/training/main.py FAILED

Failures:

Root Cause (first observed failure): [0]: time : 2023-05-29_00:05:28 host : task-20230528140505-13208 rank : 0 (local_rank: 0) exitcode : 1 (pid: 29322) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

huhuhuqia commented 1 year ago

python版本是3.8.13

import torch print(torch.version.cuda) 11.7 print(torch.version) 1.13.0a0+340c412

huhuhuqia commented 1 year ago

这是相应shell文件的参数配置： GPUS_PER_NODE=1

WORKER_CNT=1

export MASTER_ADDR=127.0.0.1

export MASTER_PORT=8514

export RANK=0

export PYTHONPATH=${PYTHONPATH}:pwd/cn_clip/

DATAPATH=${1}

train_data=${DATAPATH}/datasets/MUGE/lmdb/train val_data=${DATAPATH}/datasets/MUGE/lmdb/valid # if val_data is not specified, the validation will be automatically disabled

resume=${DATAPATH}/pretrained_weights/clip_cn_vit-b-16.pt # or specify your customed ckpt path to resume reset_data_offset="--reset-data-offset" reset_optimizer="--reset-optimizer"

output_base_dir=${DATAPATH}/experiments/ name=muge_finetune_vit-b-16_roberta-base_bs128_8gpu save_step_frequency=999999 # disable it save_epoch_frequency=1 log_interval=1 report_training_batch_acc="--report-training-batch-acc"

context_length=52 warmup=100 batch_size=128 valid_batch_size=128 accum_freq=1 lr=5e-5 wd=0.001 max_epochs=3 valid_step_interval=150 valid_epoch_interval=1 vision_model=ViT-B-16 text_model=RoBERTa-wwm-ext-base-chinese use_augment="--use-augment"

python -m torch.distributed.launch --use_env --nproc_per_node=${GPUS_PER_NODE} --nnodes=${WORKER_CNT} --node_rank=${RANK} \ --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} cn_clip/training/main.py \ --train-data=${train_data} \ --val-data=${val_data} \ --resume=${resume} \ ${reset_data_offset} \ ${reset_optimizer} \ --logs=${output_base_dir} \ --name=${name} \ --save-step-frequency=${save_step_frequency} \ --save-epoch-frequency=${save_epoch_frequency} \ --log-interval=${log_interval} \ ${report_training_batch_acc} \ --context-length=${context_length} \ --warmup=${warmup} \ --batch-size=${batch_size} \ --valid-batch-size=${valid_batch_size} \ --valid-step-interval=${valid_step_interval} \ --valid-epoch-interval=${valid_epoch_interval} \ --accum-freq=${accum_freq} \ --lr=${lr} \ --wd=${wd} \ --max-epochs=${max_epochs} \ --vision-model=${vision_model} \ ${use_augment} \ --text-model=${text_model}

DtYXs commented 1 year ago

您好，判断应该是您的torch.version为1.13.0a0，在执行v1 = [int(entry) for entry in version1.split("+")[0].split(".")]的时候split(".")最后出现了0a0无法判断导致错误。 https://github.com/OFA-Sys/Chinese-CLIP/blob/2c38d03557e50eadc72972b272cebf840dbc34ea/cn_clip/training/main.py#L134 直接将这一行改为find_unused_parameters = False应该就可以了

huhuhuqia commented 1 year ago

感谢🙇‍，已解决。

OFA-Sys / Chinese-CLIP

脚本运行错误 #124