NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
809 stars 229 forks source link

Don't call MPI_Comm_split if NCCL_TESTS_SPLIT_MASK is not set #151

Open tstruk opened 1 year ago

tstruk commented 1 year ago

No need to call MPI_Comm_split if the NCCL_TESTS_SPLIT_MASK variable is not set. It only causes unnecessary traffic with no real effect, and in some cases might cause the ranks be invalid.

sjeaugey commented 1 year ago

in some cases might cause the ranks be invalid

Can you say more about this?

tstruk commented 1 year ago

Can you say more about this?

Hi, Yes, that is running on MPICH 4.1.2, and libfabric 1.18, with a custom provider, NCCL version 2.18.3, across 4 nodes. Sometimes what I see is that after calling MPI_Comm_split() with the parameter color equal to 0, the rank in one of the nodes is not correct. The color is 0 because I don't set the NCCL_TESTS_SPLIT_MASK environment variable. Here is the debug code I added:

diff --git a/src/common.h b/src/common.h
index 20fa461..d2359ea 100644
--- a/src/common.h
+++ b/src/common.h
@@ -282,5 +282,11 @@ static int ncclstringtoop (char *str) {
 extern int is_main_proc;
 extern thread_local int is_main_thread;
 #define PRINT if (is_main_thread) printf
+#define PRINT_ALL_NODES(STR, ...)      \
+({                                     \
+  char hn[256];                        \
+  gethostname(hn, 256);                \
+  printf("%s: " STR "\n", hn, ## __VA_ARGS__); \
+})

and

diff --git a/src/common.cu b/src/common.cu
index 48a629c..a7531f6 100644
--- a/src/common.cu
+++ b/src/common.cu
@@ -852,6 +852,8 @@ testResult_t run() {
 #ifdef MPI_SUPPORT
   MPI_Comm_size(MPI_COMM_WORLD, &totalProcs);
   MPI_Comm_rank(MPI_COMM_WORLD, &proc);
+PRINT_ALL_NODES("proc %d, totalProcs %d\n", proc, totalProcs);
+
   uint64_t hostHashs[totalProcs];
   hostHashs[proc] = getHostHash(hostname);
   MPI_Allgather(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL, hostHashs, sizeof(uint64_t), MPI_BYTE, MPI_COMM_WORLD);
@@ -867,6 +869,8 @@ testResult_t run() {
   MPI_Comm_split(MPI_COMM_WORLD, color, proc, &mpi_comm);
   MPI_Comm_size(mpi_comm, &ncclProcs);
   MPI_Comm_rank(mpi_comm, &ncclProc);
+PRINT_ALL_NODES("ncclProc %d, ncclProcs %d\n", ncclProc, ncclProcs);
+
 #endif
   is_main_thread = is_main_proc = (proc == 0) ? 1 : 0;

and here is what I see:

Node1: proc 1, totalProcs 4 
Node3: proc 3, totalProcs 4 
Node2: proc 2, totalProcs 4 
Node0: proc 0, totalProcs 4 
Node1: ncclProc 1, ncclProcs 4 
Node3: ncclProc 3, ncclProcs 4 
Node2: ncclProc 0, ncclProcs 2 
Node0: ncclProc 0, ncclProcs 4 

The MPICH documentation [1] says that the color should be "non-negative integer" so 0 should be ok, but it's not. When this happens the test hangs after the NCCL init phase. After the fix is applied everything works as expected.

[1] - https://www.mpich.org/static/docs/v4.1.2/www3/MPI_Comm_split.html