Open kwen2501 opened 1 year ago
Note that in the call site, we are running multiple threads, so it is possible that it's a multithreading disaster on our end, but at the time of the segfault only one thread is making progress (everyone else is waiting on locks).
That stack trace shows ncclMemoryStackPop within ncclCommGroupLeave as failing. Those operations happen for every kernel launch, so the only way I could envision it being a bug there is if we are returning an error. Erroneous paths are not as well tested as success paths, and confidence in the correctness of the success path is very high.
A race by pytorch of multiple threads on the same nccl comm could definitely cause this. To ensure there is no race, you could wrap every nccl call with a lock specific to the comm:
ncclComm_t comm;
mutex comm_mu;
assert(comm_mu.try_lock()); // Only fails if there is a race
ncclFoo(comm);
comm_mu.unlock();
Does nccl have any facilities for helping identify a race? I was hoping the tracing would show me the push/pops on this stack, but I must be holding it wrong, NCCL_DEBUG=TRACE didn't print anything all that detailed for me.
NCCL doesn't have any race detection. Logs won't be helpful. You can either roll your own from the outside as I described, or I can push a feature branch with some race detection, but that will have a few days of delay before it would be available.
I'm not in a rush!
Its ready. See branch v2.17-racecheck.
It uses C style assert()
to catch concurrent access.
I gave this a try and it seems to deadlock now.
The racecheck code doesn't add any blocking logic (locks, or spin loops, etc). So this deadlock must have already been present just not triggered. Can you determine where the threads are stuck?
Stack trace here: https://github.com/pytorch/pytorch/issues/101160#issuecomment-1569385549
me->topFrame.below
is NULL at the segfault.