Open chenyu-jiang opened 1 year ago
I think you wanted to use i * 2
and i * 2 + 1
here:
for (int i = 0; i < world_size; i++) {
recv_counts.push_back(send_counts[rank * 2]);
recv_counts.push_back(send_counts[rank * 2 + 1]);
}
Also you may want to check return codes of all CUDA and NCCL calls (using e.g. NCCLCHECK or CUDACHECK macros). Not doing so can make things hard to debug.
Hi @sjeaugey Thanks for the prompt response!
I feel it is correct to use rank * 2
and rank * 2 + 1
here:
The goal is to make every rank send send_counts[0]*4
, send_counts[1]*4
bytes of data to rank 0, send_counts[2]*4
, send_counts[3]*4
bytes of data to rank 1 and so on. So rank r
will receive send_counts[2*r]*4
, send_counts[2*r + 1]*4
bytes of data from every rank.
As suggested, I have added NCCLCHECK
and CUDACHECK
around all CUDA and NCCL calls (as reflected in the updated example code). But none of them catches any error and the output is the same.
Ah. My bad, indeed rank is correct. Sorry I misinterpreted how you were using this array.
Thanks for the confirmation.
I can repro and I do see the local copy copying with NULL as the recvBuff. I should be able to figure this out quickly.
Ok so indeed it's a bug with how we aggregate operations; in particular for self-sendrecv, we need to ensure they're next to each other.
Setting NCCL_NCHANNELS_PER_NET_PEER=1
should work as a workaround until this is resolved.
Thanks! Will setting NCCL_NCHANNELS_PER_NET_PEER=1
affect communication speed? I am currently getting around the problem by skipping self-sendrecvs and manually adding cudaMemcpyAsync
s. Of course it would be easier if self-sendrecvs are correctly handled by NCCL.
Hi @sjeaugey can you provide some more context to local copy copying with NULL as the recvBuff
? Where can we identify this in the NCCL code?
It's pretty complex, and I'm trying to find a way to fix this without changing too much code.
It's tied to how we pack operations into ncclWorkElem structures. Self send/recv are supposed to be next to each other for the code to work, but in that precise case it isn't.
Here is a patch which should fix the bug.
diff --git a/src/enqueue.cc b/src/enqueue.cc
index dbb9865bc..71bf45a60 100644
--- a/src/enqueue.cc
+++ b/src/enqueue.cc
@@ -633,7 +633,6 @@ static ncclResult_t scheduleP2pTasksToPlan(
for (int i=0; i < tasks->p2pOrderSteps; i++) {
int sendPeer = sendOrder[i];
int recvPeer = recvOrder[i];
- if ((i % (NCCL_MAX_WORK_ELEMENTS_P2P/2)) == 0) fuseOk = false;
struct ncclTaskP2p* send = sendPeer != -1 ? ncclIntruQueueHead(&peers[sendPeer].sendQueue) : NULL;
struct ncclTaskP2p* recv = recvPeer != -1 ? ncclIntruQueueHead(&peers[recvPeer].recvQueue) : NULL;
if (sendPeer == comm->rank) {
@@ -669,6 +668,7 @@ static ncclResult_t scheduleP2pTasksToPlan(
if (send) sendBytes -= send->chunk*sendChunkBytesMax;
do {
+ if ((i % (NCCL_MAX_WORK_ELEMENTS_P2P/2)) == 0) fuseOk = false;
ssize_t recvChunkBytes = std::min(recvBytes, recvChunkBytesMax); // -1 preserved
ssize_t sendChunkBytes = std::min(sendBytes, sendChunkBytesMax);
if (recvChunkBytes != 0) {
Please confirm this is fixing the issue.
The goal of fuseOk
is to avoid fusing operations from different nodes, and make sure self-communication would be at the beginning at the workElem. But fuseOk was only set right for the first chunk; if we split the operation on multiple channels, the second channel may experience unwanted fusion, potentially causing hangs and breaking self-communication.
Thanks! The error is indeed gone after applying the patch.
Hi,
I am trying to implement a special AllToAllv where each rank have multiple data chunks to send to every other rank (each chunk can be of different size) using grouped ncclSend and ncclRecvs. However, I am encountering error:
an illegal memory access was encountered
with some input sizes when running on multiple nodes. The following self-contained example code reproduces the error:I am running the code on two AWS p4de instances, with NCCL version 2.18.5+cuda11.0. The compiled executable is launched through MPI on 4 GPUs in each node. Several observations:
if (i != rank)
), the error is gone.send_counts
(e.g., to 3072), the problem disappears.After many attempts, I am still unable to identify the error. Any help would be much appreciated!
Below is the log from the failed rank (when setting
NCCL_DEBUG=INFO
), for your information. (Note: EFA is disabled on the instance as I try to isolate the cause of the error. The error still exists with EFA enabled.)