[BUG/Help] <title>离线全参数微调，NCCL报错:Broken pipe

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

报错信息：[4] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Broken pipe

Expected Behavior

No response

Steps To Reproduce

在容器中使用单机多卡进行全参数微调，只修改了ds_train_fintune.sh 中的--num_gpus，其他代码保持不变，当num_gpus的值=4，用4张A100（40G）就能正常运行，但现显存不够。然后把num_gpus换成6，就出现RuntimeError: [4] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Broken pipe。有哪位大佬遇到过这个错误或者知道这个错误怎么解决？

Environment

- OS:Ubuntu 18.04
- Python:3.10
- Transformers:4.28.0
- PyTorch:1.13
- deepspeed:0.9.0
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?