OFA-Sys / Chinese-CLIP

Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
MIT License
4.49k stars 462 forks source link

训练脚本运行报错main.py: error: unrecognized arguments: --local-rank=1 #94

Open 6Roy opened 1 year ago

6Roy commented 1 year ago

您好,我希望在单机6卡训练,然而一直报错,不知道哪出错了 [0] NVIDIA GeForce RTX 3090 | 64°C, 100 % | 9619 / 24268 MB | root:bzminer/89389(7677M) yuxiang:python/22067(329M) yuxiang :python/22068(321M) yuxiang:python/22069(329M) yuxiang:python/22066(327M) yuxiang:python/22071(319M) yuxiang:python/22064(3 27M)

一直在单卡跑了6个进程,不清楚原因

if true; then nohup python -u -m torch.distributed.launch --nproc_per_node=${GPUS_PER_NODE} \ --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} /home/yuxiang/Chinese-CLIP/cn_clip/training/main.py \ --train-data=${train_data} \ --val-data=${val_data} \ --resume=${resume} \ ${reset_data_offset} \ ${reset_optimizer} \ --logs=${output_base_dir} \ --name=${name} \ --save-step-frequency=${save_step_frequency} \ --save-epoch-frequency=${save_epoch_frequency} \ --log-interval=${log_interval} \ ${report_training_batch_acc} \ --context-length=${context_length} \ --warmup=${warmup} \ --batch-size=${batch_size} \ --valid-batch-size=${valid_batch_size} \ --valid-step-interval=${valid_step_interval} \ --valid-epoch-interval=${valid_epoch_interval} \ --accum-freq=${accum_freq} \ --lr=${lr} \ --wd=${wd} \ --max-epochs=${max_epochs} \ --vision-model=${vision_model} \

DtYXs commented 1 year ago

您好,可以提供一下您的Pytorch版本吗?或许您可以尝试一下这个issue的回答看看能否解决问题https://github.com/OFA-Sys/Chinese-CLIP/issues/74#issuecomment-1490167365

6Roy commented 1 year ago

Pytorch是2.0.0

DtYXs commented 1 year ago

您好,或许您可以尝试一下1.10版本,应该是正常运行的 如果您想在2.0.0版本运行,可以进行以下代码改动后再尝试运行: 1.将.sh脚本中python3 -m torch.distributed.launch改为torchrun https://github.com/OFA-Sys/Chinese-CLIP/blob/0925e08e53b239559da6477b8fbbde62130ea15c/run_scripts/muge_finetune_vit-b-16_rbt-base.sh#L60 2.将cn_clip/training/main.py中第51行args.local_device_rank = max(args.local_rank, 0)改为args.local_device_rank = int(os.environ['LOCAL_RANK']) https://github.com/OFA-Sys/Chinese-CLIP/blob/0925e08e53b239559da6477b8fbbde62130ea15c/cn_clip/training/main.py#L51

kkiskkk commented 1 year ago

/cn_clip/training/params.py 添加: parser.add_argument("--local_rank", type=int, default=1)