THUDM / SwissArmyTransformer

SwissArmyTransformer is a flexible and powerful library to develop your own Transformer variants.
https://THUDM.github.io/SwissArmyTransformer
Apache License 2.0
1.02k stars 98 forks source link

deepspeed分布式训练出现sat ValueError inconsistent #149

Open elesun2018 opened 11 months ago

elesun2018 commented 11 months ago

deepspeed hostfile多机多卡分布式训练时出现以下问题: Traceback (most recent call last): worker0: File "finetune_XrayGLM.py", line 173, in worker0: args = get_args(args_list) worker0: File "/home/sfz/soft/miniconda3/envs/test/lib/python3.8/site-packages/sat/arguments.py", line 360, in get_args worker0: raise ValueError( worker0: ValueError: LOCAL_RANK (default 0) and args.device inconsistent. This can only happens in inference mode. Please use CUDA_VISIBLE_DEVICES=x for single-GPU training. worker0: [2023-12-14 14:49:37,662] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 9305 worker0: [2023-12-14 14:49:37,663] [ERROR] [launch.py:321:sigkill_handler] ['/home/sfz/soft/miniconda3/envs/test/bin/python', '-u', 'finetune_XrayGLM.py', '--local_rank=0', '--experiment-name', 'finetune-CityGLM', '--model-parallel-size', '2', '--mode', 'finetune', '--train-iters', '10000', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--pre_seq_len', '4', '--train-data', './data/changjing9/data.json', '--valid-data', './data/changjing9/data.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '2000', '--eval-interval', '2000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '4', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '6', '--skip-init', '--fp16', '--use_lora'] exits with return code = 1

1049451037 commented 11 months ago

XrayGLM相关的问题需要在XrayGLM仓库解决,因为我们也不太清楚他的代码是怎么写的……