Open chgdragon2023 opened 6 months ago
It's hard to see what could be wrong without seeing the whole log file. But can you check that any NCCL environment variables you are setting are being passed to all ranks not just rank/node 0. With mpirun that means passing them on the command line like -x NCCL_DEBUG=INFO
for example.
Another technique could be to generate a log file per rank and then look for one that looks different or shorter than the rest with -x NCCL_DEBUG_FILE=nccl_log.%h.%p
it turns out there is one "bad" server. With the server in the list, the nccl-test can only be run with less than 18 nodes. After excluding this server, the nccl-test can be run beyond 18 nodes.
I am able to run nccl test up to 17 nodes. However, it always hang when it goes to 18 nodes. I found that the CPU usage of the master server is very high when it hang, show as below (CPU usage goes to 804.3% ? ):
With DEBUG = INFO, it shows the it hangs at this place: