done520 commented 10 months ago

运行失败, 寻求帮助

报错信息：

Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  FutureWarning,
': [Errno 2] No such file or directorython3: can't open file '
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 36860) of binary: /home/user/anaconda3/envs/jina3/bin/python3
Traceback (most recent call last):
  File "/home/user/anaconda3/envs/jina3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/user/anaconda3/envs/jina3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/user/anaconda3/envs/jina3/lib/python3.7/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/home/user/anaconda3/envs/jina3/lib/python3.7/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/home/user/anaconda3/envs/jina3/lib/python3.7/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/home/user/anaconda3/envs/jina3/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run
    )(*cmd_args)
  File "/home/user/anaconda3/envs/jina3/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user/anaconda3/envs/jina3/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
 FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-01_07:59:28
  host      : localhost
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 36860)
  error_file: <N/A>

参数配置

#!/usr/bin/env
GPUS_PER_NODE=1
# Number of GPU workers, for single-worker training, please set to 1
WORKER_CNT=1
# The ip address of the rank-0 worker, for single-worker training, please set to localhost
export MASTER_ADDR=localhost
# The port for communication
export MASTER_PORT=8514
# The rank of this worker, should be in {0, ..., WORKER_CNT-1}, for single-worker training, please set to 0
export RANK=0 

export PYTHONPATH=${PYTHONPATH}:`pwd`/cn_clip/

DATAPATH=${1}

# data options
train_data=${DATAPATH}/datasets/MUGE/lmdb/train
val_data=${DATAPATH}/datasets/MUGE/lmdb/valid # if val_data is not specified, the validation will be automatically disabled

# restore options
resume=${DATAPATH}/pretrained_weights/clip_cn_vit-b-16.pt # or specify your customed ckpt path to resume
reset_data_offset="--reset-data-offset"
reset_optimizer="--reset-optimizer"
# reset_optimizer=""

# output options
output_base_dir=${DATAPATH}/experiments/
name=muge_finetune_vit-b-16_roberta-base_bs128_8gpu
save_step_frequency=999999 # disable it
save_epoch_frequency=1
log_interval=1
report_training_batch_acc="--report-training-batch-acc"
# report_training_batch_acc=""

# training hyper-params
context_length=52
warmup=100
batch_size=128
valid_batch_size=128
accum_freq=1
lr=5e-5
wd=0.001
max_epochs=3 # or you can alternatively specify --max-steps
valid_step_interval=150
valid_epoch_interval=1
vision_model=ViT-B-16
text_model=RoBERTa-wwm-ext-base-chinese
use_augment="--use-augment"
distllation="--distllation"
teacher_model_name="damo/multi-modal_team-vit-large-patch14_multi-modal-similarity"
# use_augment=""

python3 -m torch.distributed.launch --use_env --nproc_per_node=${GPUS_PER_NODE} --nnodes=${WORKER_CNT} --node_rank=${RANK} \
          --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} cn_clip/training/main.py \
          --train-data=${train_data} \
          --val-data=${val_data} \
          --resume=${resume} \
          ${reset_data_offset} \
          ${reset_optimizer} \
          --logs=${output_base_dir} \
          --name=${name} \
          --save-step-frequency=${save_step_frequency} \
          --save-epoch-frequency=${save_epoch_frequency} \
          --log-interval=${log_interval} \
          ${report_training_batch_acc} \
          --context-length=${context_length} \
          --warmup=${warmup} \
          --batch-size=${batch_size} \
          --valid-batch-size=${valid_batch_size} \
          --valid-step-interval=${valid_step_interval} \
          --valid-epoch-interval=${valid_epoch_interval} \
          --accum-freq=${accum_freq} \
          --lr=${lr} \
          --wd=${wd} \
          --max-epochs=${max_epochs} \
          --vision-model=${vision_model} \
          ${use_augment} \
          --text-model=${text_model} \
          ${distllation} \
          --teacher-model-name=${teacher_model_name} \

环境信息

torch 1.13.1 torchvision 0.14.1 Linux dev 5.4.0-152-generic #169~18.04.1-Ubuntu SMP Wed Jun 7 22:22:24 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux 单片GPU，NVIDIA GeForce RTX 2080 Ti , Driver Version: 530.30.02 CUDA Version: 12.1

Jaskr16 commented 10 months ago

你好，能否提供运行命令？还有基本的微调muge_finetune_vit-b-16_rbt-base.sh是否能正常运行？

erlan-11 commented 5 months ago

我的问题是image_b64为空

OFA-Sys / Chinese-CLIP

muge + finetune + distllation 运行失败, 寻求帮助 #225

运行失败, 寻求帮助

报错信息：

参数配置

环境信息