Closed caopulan closed 8 months ago
Seems we're trying to use NVLS when we shouldn't. Setting NCCL_NVLS_ENABLE=0
should workaround the problem.
Edit: it may be that the fabric manager was restarted but the GPUs weren't reset. You may want to reset your GPUS with nvidia-smi -r and try again.
Also see Section 2.2 of this document: https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf
Seems we're trying to use NVLS when we shouldn't. Setting
NCCL_NVLS_ENABLE=0
should workaround the problem.Edit: it may be that the fabric manager was restarted but the GPUs weren't reset. You may want to reset your GPUS with nvidia-smi -r and try again.
It works! THANKS A LOT
Also see Section 2.2 of this document: https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf
ok thanks
7 * H100, raise this error when test on more than 2 gpus ./build/broadcast_perf -b 8 -e 256M -f 2 -g 3
when using 2 gpus
when using 3 gpus
And I found that the
transport/nvls.cc:169
is https://github.com/NVIDIA/nccl/blob/b6d7438d3145a619f924dbbca6c96db21fab716e/src/transport/nvls.cc#L169CUCHECK(cuMulticastBindMem(resources->mcHandle, 0/*mcOffset*/, resources->ucHandle, 0/*memOffset*/, size, 0/*flags*/));