THUDM / ChatGLM-6B

ChatGLM-6B: An Open Bilingual Dialogue Language Model | 开源双语对话语言模型
Apache License 2.0
40.52k stars 5.2k forks source link

[BUG/Help] <title>离线全参数微调,NCCL报错:Broken pipe #1115

Open Kouuh opened 1 year ago

Kouuh commented 1 year ago

Is there an existing issue for this?

Current Behavior

报错信息:[4] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Broken pipe

Expected Behavior

No response

Steps To Reproduce

在容器中使用单机多卡进行全参数微调,只修改了ds_train_fintune.sh 中的--num_gpus,其他代码保持不变,当num_gpus的值=4,用4张A100(40G)就能正常运行,但现显存不够。然后把num_gpus换成6,就出现RuntimeError: [4] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Broken pipe。 有哪位大佬遇到过这个错误或者知道这个错误怎么解决?

Environment

- OS:Ubuntu 18.04
- Python:3.10
- Transformers:4.28.0
- PyTorch:1.13
- deepspeed:0.9.0
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

No response

yanyanyufei1 commented 3 months ago

我遇到了同样的问题 请问后面怎么解决的?