2.17.1 segfault at ncclMemoryStackPop

NVIDIA / nccl

Optimized primitives for collective multi-GPU communication

Other

3.27k stars 829 forks source link

2.17.1 segfault at ncclMemoryStackPop #858

Open kwen2501 opened 1 year ago

kwen2501 commented 1 year ago

Stack trace here: https://github.com/pytorch/pytorch/issues/101160#issuecomment-1569385549

me->topFrame.below is NULL at the segfault.

ezyang commented 1 year ago

Note that in the call site, we are running multiple threads, so it is possible that it's a multithreading disaster on our end, but at the time of the segfault only one thread is making progress (everyone else is waiting on locks).

jbachan commented 1 year ago

That stack trace shows ncclMemoryStackPop within ncclCommGroupLeave as failing. Those operations happen for every kernel launch, so the only way I could envision it being a bug there is if we are returning an error. Erroneous paths are not as well tested as success paths, and confidence in the correctness of the success path is very high.

A race by pytorch of multiple threads on the same nccl comm could definitely cause this. To ensure there is no race, you could wrap every nccl call with a lock specific to the comm:

ncclComm_t comm;
mutex comm_mu;

assert(comm_mu.try_lock()); // Only fails if there is a race
ncclFoo(comm);
comm_mu.unlock();

ezyang commented 1 year ago

Does nccl have any facilities for helping identify a race? I was hoping the tracing would show me the push/pops on this stack, but I must be holding it wrong, NCCL_DEBUG=TRACE didn't print anything all that detailed for me.

jbachan commented 1 year ago

NCCL doesn't have any race detection. Logs won't be helpful. You can either roll your own from the outside as I described, or I can push a feature branch with some race detection, but that will have a few days of delay before it would be available.

ezyang commented 1 year ago

I'm not in a rush!

jbachan commented 1 year ago

Its ready. See branch v2.17-racecheck.

It uses C style assert() to catch concurrent access.

ezyang commented 1 year ago

I gave this a try and it seems to deadlock now.

jbachan commented 1 year ago

The racecheck code doesn't add any blocking logic (locks, or spin loops, etc). So this deadlock must have already been present just not triggered. Can you determine where the threads are stuck?