NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.14k stars 791 forks source link

How to monitor slow nodes in ringallreduce #349

Open Richie-yan opened 4 years ago

Richie-yan commented 4 years ago

I am currently using Horovod for model training. The communication of the underlying gradient synchronization uses nccl. The problem of slow nodes will appear during the training process. Is there any way for nccl to monitor which node on the ring belongs to the slow node? For example, do time management on the nccl ringallreduce source code?

kwen2501 commented 4 years ago

Hi, one way to check the communication performance is to use the nccl-tests here. Perhaps you can run the test in a bisection manner to see if there are indeed nodes with lower communication performance?

Alternatively or in addition, you can use Nsight or nvprof to see if there are particular nodes that are slow in computation that makes all other nodes wait.

Richie-yan commented 4 years ago

Does nccl-tests support multi-machine testing? I haven't used nccl-tests. Does the test report generated by it reflect the allreduce time of each node? In addition, I understand that using nvprof can only monitor the time-consuming of some cuda calls on a single machine

Richie-yan commented 4 years ago

Hi, @kwen2501 I have roughly looked at the nccl-tests you mentioned, according to my understanding: nccl-tests actually detects the bandwidth and delay of multiple GPUs in the operation of performing nccl under the current physical environment. Assuming that a certain GPU is slow in the current physical environment, it will affect the overall bandwidth and delay.

So, the function of nccl-test is actually used to detect the communication performance of the multi-GPU physical environment. Is this understood? Also correct what I said above: nvprof can detect each rank in the case of multiple machines, such as: $ mpirun -np 2 -host c0-0,c0-1 nvprof -o output.%h.%p.%q{OMPI_COMM_WORLD_RANK} ./my_mpi_app

kwen2501 commented 4 years ago

nccl-tests supports multi-node run. And yes, if one node is slow in communication, the overall performance reported by nccl-tests would be lowered. That's why I mentioned running nccl-tests in a bisection manner to see if there is indeed a node that is slower.

Your nvprof command looks good to me.