Open tstruk opened 1 year ago
in some cases might cause the ranks be invalid
Can you say more about this?
Can you say more about this?
Hi, Yes, that is running on MPICH 4.1.2, and libfabric 1.18, with a custom provider, NCCL version 2.18.3, across 4 nodes. Sometimes what I see is that after calling MPI_Comm_split() with the parameter color equal to 0, the rank in one of the nodes is not correct. The color is 0 because I don't set the NCCL_TESTS_SPLIT_MASK environment variable. Here is the debug code I added:
diff --git a/src/common.h b/src/common.h
index 20fa461..d2359ea 100644
--- a/src/common.h
+++ b/src/common.h
@@ -282,5 +282,11 @@ static int ncclstringtoop (char *str) {
extern int is_main_proc;
extern thread_local int is_main_thread;
#define PRINT if (is_main_thread) printf
+#define PRINT_ALL_NODES(STR, ...) \
+({ \
+ char hn[256]; \
+ gethostname(hn, 256); \
+ printf("%s: " STR "\n", hn, ## __VA_ARGS__); \
+})
and
diff --git a/src/common.cu b/src/common.cu
index 48a629c..a7531f6 100644
--- a/src/common.cu
+++ b/src/common.cu
@@ -852,6 +852,8 @@ testResult_t run() {
#ifdef MPI_SUPPORT
MPI_Comm_size(MPI_COMM_WORLD, &totalProcs);
MPI_Comm_rank(MPI_COMM_WORLD, &proc);
+PRINT_ALL_NODES("proc %d, totalProcs %d\n", proc, totalProcs);
+
uint64_t hostHashs[totalProcs];
hostHashs[proc] = getHostHash(hostname);
MPI_Allgather(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL, hostHashs, sizeof(uint64_t), MPI_BYTE, MPI_COMM_WORLD);
@@ -867,6 +869,8 @@ testResult_t run() {
MPI_Comm_split(MPI_COMM_WORLD, color, proc, &mpi_comm);
MPI_Comm_size(mpi_comm, &ncclProcs);
MPI_Comm_rank(mpi_comm, &ncclProc);
+PRINT_ALL_NODES("ncclProc %d, ncclProcs %d\n", ncclProc, ncclProcs);
+
#endif
is_main_thread = is_main_proc = (proc == 0) ? 1 : 0;
and here is what I see:
Node1: proc 1, totalProcs 4
Node3: proc 3, totalProcs 4
Node2: proc 2, totalProcs 4
Node0: proc 0, totalProcs 4
Node1: ncclProc 1, ncclProcs 4
Node3: ncclProc 3, ncclProcs 4
Node2: ncclProc 0, ncclProcs 2
Node0: ncclProc 0, ncclProcs 4
The MPICH documentation [1] says that the color should be "non-negative integer" so 0 should be ok, but it's not. When this happens the test hangs after the NCCL init phase. After the fix is applied everything works as expected.
[1] - https://www.mpich.org/static/docs/v4.1.2/www3/MPI_Comm_split.html
No need to call MPI_Comm_split if the NCCL_TESTS_SPLIT_MASK variable is not set. It only causes unnecessary traffic with no real effect, and in some cases might cause the ranks be invalid.