bazel-contrib / rules_cuda

Starlark implementation of bazel rules for CUDA.
https://bazel-contrib.github.io/rules_cuda/
MIT License
90 stars 43 forks source link

Using custom thrust repo #105

Open garymm opened 1 year ago

garymm commented 1 year ago

I'm trying to use the latest Thrust release and running into issues. I can reproduce this in the examples of this repo. See here: https://github.com/garymm/rules_cuda/tree/0b47488/examples

It seems that for some reason the CUDA toolkit's thrust is used, even though the -isystem for my custom thrust comes first on the nvcc command.

To reproduce:

# First, install any CUDA toolkit at or less than version 12.1
# Then
git clone https://github.com/garymm/rules_cuda.git
cd rules_cuda
git switch garymm/custom-thrust
cd examples
bazel build -s //thrust:version_test

produces:

➜  bazel build -s //thrust:version_test                
SUBCOMMAND: # //thrust:version_test_cu [action 'Compiling thrust/version_test.cu', configuration: c92f7800966ee88d4403d82e9237650892518949abe87c60f2a9a688842ef3b8, execution platform: @local_config_platform//:host]
(cd /home/garymm/.cache/bazel/_bazel_garymm/40b8e895884343ff3f99cfc29b0e57bd/execroot/rules_cuda_examples && \
  exec env - \
    PATH=/usr/bin \
  /usr/local/cuda/bin/nvcc -x cu -Xcompiler -fPIC -ccbin /usr/bin/gcc -I . -I bazel-out/k8-fastbuild/bin -I external/thrust -I bazel-out/k8-fastbuild/bin/external/thrust -I external/cub -I bazel-out/k8-fastbuild/bin/external/cub -I external/local_cuda -I bazel-out/k8-fastbuild/bin/external/local_cuda -isystem external/thrust -isystem bazel-out/k8-fastbuild/bin/external/thrust -isystem external/cub -isystem bazel-out/k8-fastbuild/bin/external/cub -isystem external/local_cuda/cuda/include -isystem bazel-out/k8-fastbuild/bin/external/local_cuda/cuda/include -Xcompiler -g1 -c thrust/version_test.cu -o bazel-out/k8-fastbuild/bin/thrust/_objs/version_test_cu/version_test.pic.o --expt-relaxed-constexpr --expt-extended-lambda)
# Configuration: c92f7800966ee88d4403d82e9237650892518949abe87c60f2a9a688842ef3b8
# Execution platform: @local_config_platform//:host
ERROR: /home/garymm/src/bazel-contrib/rules_cuda/examples/thrust/BUILD.bazel:17:13: Compiling thrust/version_test.cu failed: (Exit 2): nvcc failed: error executing command (from target //thrust:version_test_cu) /usr/local/cuda/bin/nvcc -x cu -Xcompiler -fPIC -ccbin /usr/bin/gcc -I . -I bazel-out/k8-fastbuild/bin -I external/thrust -I bazel-out/k8-fastbuild/bin/external/thrust -I external/cub -I ... (remaining 25 arguments skipped)

Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
thrust/version_test.cu(4): error: static assertion failed
  static_assert((200001 / 100 % 1000) == 1);```

This is on Ubuntu, with CUDA 12.1 installed here:

➜  ls /usr/local/cuda*
/usr/local/cuda@  /usr/local/cuda-12@

/usr/local/cuda-12.1
cloudhan commented 1 year ago

See https://github.com/bazel-contrib/rules_cuda/pull/38#issuecomment-1303892768.

Won't fix if you want to upgrade the bundled version to newer version. In that case, you need to roll out your own thrust target with repository rule.

cloudhan commented 1 year ago

You can refer to https://github.com/bazel-contrib/rules_cuda/blob/cf17ea98fc73314d18cd051a848e7c9df0a97325/third_party/thrust.BUILD as an example, without forking this repo.

cloudhan commented 1 year ago

The CHAR_MIN issue seems to come from missing climits header. https://en.cppreference.com/w/cpp/types/climits

cloudhan commented 1 year ago

The root cause is https://github.com/garymm/rules_cuda/blob/d66063672a7e3ea397f40a841b2411e0e4d1ef11/examples/WORKSPACE.bazel#LL33C11-L33C11 in your code. And this file screwed everything https://github.com/NVIDIA/thrust/blob/main/thrust/limits.h.

You owe me a cup of coffee for debugging this. 😈

garymm commented 1 year ago

Removing includes does indeed fix it. I'm a little confused as to why users of the library are able to use system-include syntax without it though. I.e. #include <thrust/device_vector.h>. I thought the angle-bracket includes required -isystem and -isystem is only set if the cc_library has includes set?

cloudhan commented 1 year ago

Because there is <cuda_tookit_root>/include/thrust. The system include path of <cuda_tookit_root>/include is transitively added due to deps chain @thrust -> @cub -> @rules_cuda//:cuda -> @rules_cuda//:cuda_headers

garymm commented 1 year ago

If I change it to includes = [""], which I think is correct, the toolchain's local_cuda includes clobber my custom thrust's includes.

You can see the behavior here: https://github.com/garymm/rules_cuda/tree/3b973fa/examples

So I think @rules_cuda//:cuda_headers should not include the thrust and cub headers. Would you be open to a PR removing them from the cuda_headers target?

cloudhan commented 1 year ago

Reasonable. Would you like to contribute a PR for this? All you need to change is exclude thrust and cub from the glob and add two seperate libraries for cub and thrust.

https://github.com/bazel-contrib/rules_cuda/blob/00a1c59013464bcb3f643a4751b3859428c29659/cuda/runtime/BUILD.local_cuda#L20-L30

garymm commented 1 year ago

I just tried that and it doesn't work for some reason... will investigate.

cloudhan commented 1 year ago

OK, this will be tricky because the cuda toolkit include will be symlinked by bazel into the sandbox. Then all sub directory will be accessible...

garymm commented 1 year ago

I updated the issue description with the latest status. I'm really not sure what's going wrong. I tried using the -M and -H options for the compiler, and I also tried passing --generate-dependencies-with-compile to nvcc, and in all cases it seems to suggest that my custom thrust's header should be the on that's included, but the actually compiled code uses the CUDA toolkit bundled version.

When I run the exact command printed out by bazel build -s, it seems to work. So maybe there is some environment variable or something that is set that is causing the include directory of the toolkit to be searched in a different order than what is suggested on the command line?

garymm commented 1 year ago

Hmm, indeed there's something about the sandbox that changes the behavior. Building with --sandbox_strategy=local works.

garymm commented 1 year ago

And for some reason removing includes = ["."] from the thrust and cub rules works, even though the resulting change in the nvcc command seems like the opposite of what I would expect to want.

garymm commented 1 year ago

I can contribute an example that uses custom thrust and cub with no includes = line and then close this, but I'm a bit at a loss as to why the includes breaks things. If you have some idea please LMK. Or LMK if you want me to just contribute the example and then close this.