SHI-Labs / NATTEN

Neighborhood Attention Extension. Bringing attention to a neighborhood near you!
https://shi-labs.com/natten/
Other
341 stars 25 forks source link

Cmake v3.30.2 cudart link error #154

Open Birch-san opened 1 month ago

Birch-san commented 1 month ago

As there wasn't a torch 2.4.0 wheel, I tried building NATTEN myself. It didn't go as smoothly as usual.

Most problems were due to cmake giving misleading/incomplete error messages. These are the various errors I hit along the way:
https://github.com/Birch-san/sdxl-play/pull/3#issuecomment-2267965250

Ultimately I think most problems here were just "my gcc and g++ alternatives didn't point anywhere after Ubuntu upgrade", but there is one change I had to make to setup.py to get it to build, and I'm not sure why cmake wasn't able to figure this out automatically, or try it as a guess:

setup.py

  f"-DNATTEN_CUDA_ARCH_LIST={cuda_arch_list_str}",
+ f"-DCUDA_CUDART_LIBRARY=/usr/local/cuda/lib64/libcudart.so",

Perhaps the reason things have changed is because the newer cmake demises FindCUDA?

CMake Warning (dev) at CMakeLists.txt:11 (find_package):
  Policy CMP0146 is not set: The FindCUDA module is removed.  Run "cmake
  --help-policy CMP0146" for policy details.  Use the cmake_policy command to
  set the policy and suppress this warning.

This warning is for project developers.  Use -Wno-dev to suppress it.

Anyway, passing in the CUDA_CUDART_LIBRARY option persuaded it to try compiling.

Unfortunately it looks like that wasn't what it wanted… linking failed at the end of all of that.

/home/birch/git/sdxl-play/venv-311/lib/python3.11/site-packages/cmake/data/bin/cmake -E cmake_link_script CMakeFiles/natten.dir/link.txt --verbose=1
/usr/bin/c++ -fPIC  -std=c++17 -shared -Wl,-soname,natten/libnatten.cpython-311-x86_64-linux-gnu.so -o natten/libnatten.cpython-311-x86_64-linux-gnu.so … -lcudart /usr/local/cuda/lib64/libcudart.so /usr/local/cuda/lib64/libnvToolsExt.so -lcudadevrt -lcudart_static -lrt -lpthread -ldl
/usr/bin/ld: cannot find -lcudart: No such file or directory
/usr/bin/ld: cannot find -lcudadevrt: No such file or directory
/usr/bin/ld: cannot find -lcudart_static: No such file or directory

seems like a perfectly typical value for CUDA_CUDART_LIBRARY though. and the library certainly exists:

ls /usr/local/cuda/lib64/ | grep cudart
libcudart.so
libcudart.so.12
libcudart.so.12.2.53
libcudart_static.a

any idea what I'm doing wrong? the errors don't seem rational…

Birch-san commented 1 month ago

I guess the reason CUDA_CUDART_LIBRARY was ineffective, is that -lcudart appears in the libraries list in addition to /usr/local/cuda/lib64/libcudart.so.

probably what I really need to do is add tell it to link the library dir /usr/local/cuda/lib64, so that it can find -lcudart -lcudadevrt -lcudart_static in that dir.

just need to remember which cmake convention to use for that…

alihassanijr commented 1 month ago

Apologies for this; I dropped the ball on the 2.4 release; I'll build those wheels tonight.

I've always had bad experience with FindCUDA, and unfortunately it's difficult to link with libtorch through cmake without including theirs, and that's when everything goes wrong. Every time I've figured out a way around it it's been a hack, but somehow torch's docker images and NGC images aren't affected. So I don't think it's anything wrong with your environment, rather just FindCUDA being annoying as usual.

Also, if you know which version of CUDA toolkit your local torch was compiled with I can just build that binary first and post the link here -- building wheels take a while now that 2.4 supports 3 different CTK versions and 5 python versions (together that's 15 CUDA wheels and 5 CPU.)

Birch-san commented 1 month ago

no worries, there's always too much to be done!

I'm pretty much done for the night but I think my last idea might get it building locally.

for some reason CXXFLAGS='-L/usr/local/cuda/lib64' env var didn't work, as in:

CXXFLAGS='-L/usr/local/cuda/lib64' CUDACXX=/usr/local/cuda/bin/nvcc NATTEN_CUDA_ARCH=8.9 NATTEN_VERBOSE=1 NATTEN_IS_BUILDING_DIST=1 NATTEN_WITH_CUDA=1 NATTEN_N_WORKERS=8 python setup.py bdist_wheel -d out/wheels/cu121/torch/240

and by "didn't work" I mean that it didn't introduce any -L/usr/local/cuda/lib64 option into:
build/lib.linux-x86_64-cpython-311/CMakeFiles/natten.dir/link.txt

so I modified csrc/CMakeLists.txt:

  if(${NATTEN_WITH_CUDA})
    target_link_libraries(natten PUBLIC c10 torch torch_cpu torch_python cudart c10_cuda torch_cuda)
+   message("Adding to target 'natten', link directory: ${CUDA_TOOLKIT_ROOT_DIR}/lib64")
+   target_link_directories(natten PUBLIC ${CUDA_TOOLKIT_ROOT_DIR}/lib64)

And this seems to have succeeded in adding a -L/usr/local/cuda/lib64 to natten.dir/link.txt.
will see how it goes.

=====

if you know which version of CUDA toolkit your local torch was compiled with I can just build that binary first

Thanks! Is this it?

print(torch._C._cuda_getCompiledVersion())
12010

torch.version.cuda
'12.1'

torch.__version__
'2.4.0+cu121'
alihassanijr commented 1 month ago

Yeah the find cuda module is a big pain; I've sometimes been successful in going around it but never wrote it down 😅 .

Thanks! Is this it?

Yes perfect! I'll post that wheel here when it builds.

Birch-san commented 1 month ago

ah! my local build succeeded. NATTEN now working with torch 2.4.0. in the end, all I needed was that target_link_directories() patch. wonder why.

alihassanijr commented 1 month ago

Oh nice; feel free to drop the diff here or even open a PR; I wouldn't rule out NATTEN's cmake config doing something wrong.

I guess if the actual issue was a linking error in the end it makes sense; I originally thought FindCUDA was just blocking everything. Anyway I'll try and redo the cmake config soon; I hacked it together one time last year when we made the switch and haven't looked at it since.