NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.25k stars 821 forks source link

undefined references when compiling pytorch with nccl v2.12 #639

Open juncgu opened 2 years ago

juncgu commented 2 years ago

Environment:

I came up with the following errors when compiling pytorch 1.10.0 with NCCL v2.12:

/xxxxx/pytorch/build/lib/libtorch_cuda_cpp.so: undefined reference to `ncclNetVersion()'
/xxxxx/pytorch/build/lib/libtorch_cuda_cpp.so: undefined reference to `ncclCollNet'
/xxxxx/pytorch/build/lib/libtorch_cuda_cpp.so: undefined reference to `ncclNet'
/xxxxx/pytorch/build/lib/libtorch_cuda_cpp.so: undefined reference to `ncclGpuGdrSupport(int*)'
/xxxxx/pytorch/build/lib/libtorch_cuda_cpp.so: undefined reference to `ncclNetInit()'
collect2: error: ld returned 1 exit status
[6476/6723] Linking CXX shared library lib/[libtorch.so](http://libtorch.so/)

Above that, the log shows that the nccl library has been generated:

Linking    libnccl.so.2.12.6                   > /xxxxx/pytorch/build/nccl/lib/libnccl.so.2.12.6
Archiving  libnccl_static.a                    > /xxxxx/pytorch/build/nccl/lib/libnccl_static.a

If I use an older version of NCCL (ie, v2.11.4) in the same environment, then there will be no such issues.

changlan commented 2 years ago

I've seen the same issue. Did you find out the solution?

sjeaugey commented 2 years ago

It would look like PyTorch was linked with the wrong version of NCCL (the one installed on the system vs the one compiled locally?) and all exported symbols in libnccl.so were referenced somehow inside PyTorch, even though PyTorch is not using those symbols directly. Weird.

juncgu commented 2 years ago

I've seen the same issue. Did you find out the solution?

No, I switched back to older version. Is https://github.com/pytorch/pytorch/pull/79132 ready for test?

changlan commented 2 years ago

It seems that removing the "library slimming" fixes the issue, though I'm not sure why: https://github.com/pytorch/pytorch/pull/79132/commits/764b7e6b161971217531ab39c666842aaa5e6b0b