bazel-contrib / rules_cuda

Starlark implementation of bazel rules for CUDA.
https://bazel-contrib.github.io/rules_cuda/
MIT License
85 stars 43 forks source link

[BUG] linking failed using custom cc_toolchain with platform_cpu, cannot detect multiplatform cuda library. #265

Open ZhenshengLee opened 1 month ago

ZhenshengLee commented 1 month ago

brief

NOTE: in the default platform, which is x86_64(k8) toolchain , the compile and linking works. I wonder if it's a bug or just a misconfiguration during usage of this repo?

environment

bazel: version7.0.2 cctoolchain: //bazel/toolchains/v5l (a custom cc toolchain for cross compile in aarch64, like https://github.com/f0rmiga/gcc-toolchain/blob/main/toolchain/cc_toolchain_config.bzl)

├── toolchains
│   └── v5l
│       ├── BUILD
│       ├── v5l.BUILD
│       └── v5l_cc_toolchain_config.bzl

repro steps

simply compile the basic example with cu_library and report the following errors. NOTE: in the default platform, which is x86_64(k8) toolchain , the compile works.

cc_binary(
    name = "module_cuda_main",
    srcs = ["tool/module_cuda_main.cpp"],
    includes = ["include"],
    tags = ["tool"],
    visibility = ["//main:__pkg__"],
    deps = [
        ":module_cuda"
    ]
(03:11:14) INFO: Current date is 2024-08-08
(03:11:14) INFO: Analyzed 323 targets (0 packages loaded, 21 targets configured).
(03:11:14) ERROR: /gw_demo/modules/team_demo/module_demo/BUILD:57:15: Linking modules/team_demo/module_demo/module_cuda_main failed: (Exit 1): aarch64-buildroot-linux-gnu-gcc failed: error executing CppLink command (from target //modules/team_demo/module_demo:module_cuda_main) 
  (cd /home/zs/.cache/bazel/_bazel_zs/2c098eac6c684e1fabebb74f5f4483bd/execroot/gaos && \
  exec env - \
    LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/lib:/usr/lib/x86_64-linux-gnu:/opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0:/opt/ros/humble/opt/rviz_ogre_vendor/lib:/opt/ros/humble/lib/x86_64-linux-gnu:/opt/ros/humble/lib \
    PATH=/usr/local/cuda/bin:/opt/rti.com/rti_connext_dds-6.0.1/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/ros/humble/bin \
    PWD=/proc/self/cwd \
  external/v5l_cc_toolchain/bin/aarch64-buildroot-linux-gnu-gcc -o bazel-out/aarch64-dbg/bin/modules/team_demo/module_demo/module_cuda_main -Xlinker -rpath -Xlinker '$ORIGIN/../../../_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so___Ucuda_Slib64' -Xlinker -rpath -Xlinker '$ORIGIN/module_cuda_main.runfiles/gaos/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so___Ucuda_Slib64' -Xlinker -rpath -Xlinker '$ORIGIN/../../../_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so.11.0___Ucuda_Slib64' -Xlinker -rpath -Xlinker '$ORIGIN/module_cuda_main.runfiles/gaos/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so.11.0___Ucuda_Slib64' -Xlinker -rpath -Xlinker '$ORIGIN/../../../_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so.11.4.409___Ucuda_Slib64' -Xlinker -rpath -Xlinker '$ORIGIN/module_cuda_main.runfiles/gaos/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so.11.4.409___Ucuda_Slib64' -Xlinker -rpath -Xlinker '$ORIGIN/../../../_solib_aarch64-buildroot-linux-gnu/_U@@cuda_Uaarch64_Ulinux_S_S_Ccudart___Ulib' -Xlinker -rpath -Xlinker '$ORIGIN/module_cuda_main.runfiles/gaos/_solib_aarch64-buildroot-linux-gnu/_U@@cuda_Uaarch64_Ulinux_S_S_Ccudart___Ulib' -Xlinker -rpath -Xlinker '$ORIGIN/../../../_solib_aarch64-buildroot-linux-gnu/_U@@cuda_Uaarch64_Ulinux_S_S_Ccuda___Ulib_Sstubs' -Xlinker -rpath -Xlinker '$ORIGIN/module_cuda_main.runfiles/gaos/_solib_aarch64-buildroot-linux-gnu/_U@@cuda_Uaarch64_Ulinux_S_S_Ccuda___Ulib_Sstubs' -Lbazel-out/aarch64-dbg/bin/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so___Ucuda_Slib64 -Lbazel-out/aarch64-dbg/bin/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so.11.0___Ucuda_Slib64 -Lbazel-out/aarch64-dbg/bin/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so.11.4.409___Ucuda_Slib64 -Lbazel-out/aarch64-dbg/bin/_solib_aarch64-buildroot-linux-gnu/_U@@cuda_Uaarch64_Ulinux_S_S_Ccudart___Ulib -Lbazel-out/aarch64-dbg/bin/_solib_aarch64-buildroot-linux-gnu/_U@@cuda_Uaarch64_Ulinux_S_S_Ccuda___Ulib_Sstubs bazel-out/aarch64-dbg/bin/modules/team_demo/module_demo/_objs/module_cuda_main/module_cuda_main.pic.o bazel-out/aarch64-dbg/bin/modules/team_demo/module_demo/libmodule_cuda.a external/local_cuda/cuda/lib64/libcudadevrt.a -lcudart -l:libcudart.so.11.0 -l:libcudart.so.11.4.409 -lcudart -lcuda -pie -ldl -lpthread -lrt -Wl,-rpath,lib/ -L/drive/drive-linux/lib-target/ -L/drive/drive-linux/filesystem/targetfs/usr/lib/aarch64-linux-gnu/ -Wl,-rpath-link,/drive/drive-linux/filesystem/targetfs/usr/lib/aarch64-linux-gnu/ -Wl,-rpath-link,/drive/drive-linux/lib-target/ -Wl,-rpath-link,/usr/lib/aarch64-linux-gnu -Wl,-rpath-link,/usr/aarch64-linux-gnu -Wl,-rpath-link,/drive/drive-linux/filesystem/targetfs/lib/aarch64-linux-gnu -lgcov -lstdc++ -no-canonical-prefixes)
# Configuration: 93bfd7653555f545157f5fbb9812135069a379b953233ab0eef19c8f88c3340d
# Execution platform: @@local_config_platform//:host
/drive/toolchains/aarch64--glibc--stable-2022.03-1/bin/../lib/gcc/aarch64-buildroot-linux-gnu/9.3.0/../../../../aarch64-buildroot-linux-gnu/bin/ld: skipping incompatible bazel-out/aarch64-dbg/bin/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so___Ucuda_Slib64/libcudart.so when searching for -lcudart
/drive/toolchains/aarch64--glibc--stable-2022.03-1/bin/../lib/gcc/aarch64-buildroot-linux-gnu/9.3.0/../../../../aarch64-buildroot-linux-gnu/bin/ld: skipping incompatible bazel-out/aarch64-dbg/bin/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so.11.0___Ucuda_Slib64/libcudart.so.11.0 when searching for -l:libcudart.so.11.0
/drive/toolchains/aarch64--glibc--stable-2022.03-1/bin/../lib/gcc/aarch64-buildroot-linux-gnu/9.3.0/../../../../aarch64-buildroot-linux-gnu/bin/ld: skipping incompatible bazel-out/aarch64-dbg/bin/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so.11.0___Ucuda_Slib64/libcudart.so.11.0 when searching for -l:libcudart.so.11.0
/drive/toolchains/aarch64--glibc--stable-2022.03-1/bin/../lib/gcc/aarch64-buildroot-linux-gnu/9.3.0/../../../../aarch64-buildroot-linux-gnu/bin/ld: cannot find -l:libcudart.so.11.0
/drive/toolchains/aarch64--glibc--stable-2022.03-1/bin/../lib/gcc/aarch64-buildroot-linux-gnu/9.3.0/../../../../aarch64-buildroot-linux-gnu/bin/ld: skipping incompatible bazel-out/aarch64-dbg/bin/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so.11.4.409___Ucuda_Slib64/libcudart.so.11.4.409 when searching for -l:libcudart.so.11.4.409
/drive/toolchains/aarch64--glibc--stable-2022.03-1/bin/../lib/gcc/aarch64-buildroot-linux-gnu/9.3.0/../../../../aarch64-buildroot-linux-gnu/bin/ld: skipping incompatible bazel-out/aarch64-dbg/bin/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so.11.4.409___Ucuda_Slib64/libcudart.so.11.4.409 when searching for -l:libcudart.so.11.4.409
/drive/toolchains/aarch64--glibc--stable-2022.03-1/bin/../lib/gcc/aarch64-buildroot-linux-gnu/9.3.0/../../../../aarch64-buildroot-linux-gnu/bin/ld: cannot find -l:libcudart.so.11.4.409
/drive/toolchains/aarch64--glibc--stable-2022.03-1/bin/../lib/gcc/aarch64-buildroot-linux-gnu/9.3.0/../../../../aarch64-buildroot-linux-gnu/bin/ld: skipping incompatible bazel-out/aarch64-dbg/bin/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so___Ucuda_Slib64/libcudart.so when searching for -lcudart
collect2: error: ld returned 1 exit status
(03:11:14) INFO: Elapsed time: 0.602s, Critical Path: 0.08s
(03:11:14) INFO: 2 processes: 2 internal.
(03:11:14) ERROR: Build did NOT complete successfully

considerations

skipping incompatible bazel-out/aarch64-dbg/bin/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so.11.0___Ucuda_Slib64/libcudart.so.11.0 when searching for -l:libcudart.so.11.0

which means the cuda libraries is still the lib64 version in usr/local/cuda/lib64.

actually the cuda libraries may be installed in other dirs and may consist multiple arch version. especially in nvidia AGX machines.

  • CUDA: Should be installed at /usr/local, CUDA for various platforms should be in the target directory of /usr/local/cuda-X
    • e.g. aarch64-linux CUDA 10.1 should be located at /usr/local/cuda-10.1/targets/aarch64-linux
  • CUDA-X DL Libs (i.e. TensorRT and cuDNN): Should be located at /usr/local/cuda-X/dl/targets/<PLATFORM>/{include, lib}
  • Other system dependencies: Dependencies should be located in /usr/local/{include, lib} for x86_64, /usr/aarch64-linux-gnu/ for aarch64-linux and /usr/aarch64-unknown-nto-qnx/aarch64le for aarch64-qnx

https://github.com/NVIDIA/DL4AGX/blob/9a4f60c2847d32e81372b9a2165299a3b65eabf1/CONTRIBUTING.md?plain=1#L201-L205

related info

there is an old version of cuda toolchain config which supports multiplatform_cpu compile in bazel, but the CROSSTOOL is outdated and not available in the latest version of bazel.

https://github.com/NVIDIA/DL4AGX/tree/master

EDIT: there already has an issue talking about resolving multiple version of cuda libraries, but I don't think the issue resolved by design https://github.com/bazel-contrib/rules_cuda/issues/113

workaround(works)

add the library path manually should compile the binary successfully.

linkopts = [
        "-L/usr/local/cuda/targets/aarch64-linux/lib",
    ],
ZhenshengLee commented 1 month ago

I've found that in the doc page

rules_cuda_dependencies(toolkit_path) Populate the dependencies for rules_cuda. This will setup workspace dependencies (other bazel rules) and local toolchains. Name Description Default Value toolkit_path Optionally specify the path to CUDA toolkit. If not specified, it will be detected automatically.

Is there an example to show how to use it correctly?

cloudhan commented 1 month ago

I think you are configuring it correctly. The root problem is the cross compiling is not addressed in this rule at the moment. So exec_compatible_with for tools and target_compatible_with for runtime are assumed to be the same, but they are not enforced so it is workaroundable.

ZhenshengLee commented 1 month ago

I think you are configuring it correctly. The root problem is the cross compiling is not addressed in this rule at the moment.

OK, I will keep the issue open.