Open juncgu opened 2 years ago
I've seen the same issue. Did you find out the solution?
It would look like PyTorch was linked with the wrong version of NCCL (the one installed on the system vs the one compiled locally?) and all exported symbols in libnccl.so were referenced somehow inside PyTorch, even though PyTorch is not using those symbols directly. Weird.
I've seen the same issue. Did you find out the solution?
No, I switched back to older version. Is https://github.com/pytorch/pytorch/pull/79132 ready for test?
It seems that removing the "library slimming" fixes the issue, though I'm not sure why: https://github.com/pytorch/pytorch/pull/79132/commits/764b7e6b161971217531ab39c666842aaa5e6b0b
Environment:
I came up with the following errors when compiling pytorch 1.10.0 with NCCL v2.12:
Above that, the log shows that the nccl library has been generated:
If I use an older version of NCCL (ie, v2.11.4) in the same environment, then there will be no such issues.