报错信息:[4] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Broken pipe
Expected Behavior
No response
Steps To Reproduce
在容器中使用单机多卡进行全参数微调,只修改了ds_train_fintune.sh 中的--num_gpus,其他代码保持不变,当num_gpus的值=4,用4张A100(40G)就能正常运行,但现显存不够。然后把num_gpus换成6,就出现RuntimeError: [4] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Broken pipe。 有哪位大佬遇到过这个错误或者知道这个错误怎么解决?
Environment
- OS:Ubuntu 18.04
- Python:3.10
- Transformers:4.28.0
- PyTorch:1.13
- deepspeed:0.9.0
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :
Is there an existing issue for this?
Current Behavior
报错信息:[4] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Broken pipe
Expected Behavior
No response
Steps To Reproduce
在容器中使用单机多卡进行全参数微调,只修改了ds_train_fintune.sh 中的--num_gpus,其他代码保持不变,当num_gpus的值=4,用4张A100(40G)就能正常运行,但现显存不够。然后把num_gpus换成6,就出现RuntimeError: [4] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Broken pipe。 有哪位大佬遇到过这个错误或者知道这个错误怎么解决?
Environment
Anything else?
No response