[BUG]: Build errors from tensormap_replace.h due to `NV_IF_ELSE_TARGET(NV_HAS_FEATURE_SM_90a, ...`

lw commented 6 months ago

Is this a duplicate?

[X] I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct

Type of Bug

Compile-time Error

Component

libcu++

Describe the bug

When using cccl from the main branch, I'm hitting compile errors such as these:

    /home/lcw/micromamba/envs/myenv/include/cuda/std/detail/libcxx/include/__cuda/ptx/instructions/tensormap_replace.h(56): error: expected an expression
        { _NV_IF__NV_TARGET_BOOL_NV_HAS_FEATURE_SM_90a(( asm ( "tensormap.replace.tile.global_address.global.b1024.b64    [%0], %1;" : : "l"(__as_ptr_gmem(__tm_addr)), "l"(__as_b64(__new_val)) : "memory" ); ), ( __cuda_ptx_tensormap_replace_global_address_is_not_supported_before_SM_90a__(); )) }
                                                         ^

    /home/lcw/micromamba/envs/myenv/include/cuda/std/detail/libcxx/include/__cuda/ptx/instructions/tensormap_replace.h(56): error: expected a ")"
        { _NV_IF__NV_TARGET_BOOL_NV_HAS_FEATURE_SM_90a(( asm ( "tensormap.replace.tile.global_address.global.b1024.b64    [%0], %1;" : : "l"(__as_ptr_gmem(__tm_addr)), "l"(__as_b64(__new_val)) : "memory" ); ), ( __cuda_ptx_tensormap_replace_global_address_is_not_supported_before_SM_90a__(); )) }
                                                                                                                                                                                                             ^

    /home/lcw/micromamba/envs/myenv/include/cuda/std/detail/libcxx/include/__cuda/ptx/instructions/tensormap_replace.h(56): error: identifier "_NV_IF__NV_TARGET_BOOL_NV_HAS_FEATURE_SM_90a" is undefined
        { _NV_IF__NV_TARGET_BOOL_NV_HAS_FEATURE_SM_90a(( asm ( "tensormap.replace.tile.global_address.global.b1024.b64    [%0], %1;" : : "l"(__as_ptr_gmem(__tm_addr)), "l"(__as_b64(__new_val)) : "memory" ); ), ( __cuda_ptx_tensormap_replace_global_address_is_not_supported_before_SM_90a__(); )) }
          ^

    /home/lcw/micromamba/envs/myenv/include/cuda/std/detail/libcxx/include/__cuda/ptx/instructions/tensormap_replace.h(56): error: expected an expression
        { _NV_IF__NV_TARGET_BOOL_NV_HAS_FEATURE_SM_90a(( asm ( "tensormap.replace.tile.global_address.global.b1024.b64    [%0], %1;" : : "l"(__as_ptr_gmem(__tm_addr)), "l"(__as_b64(__new_val)) : "memory" ); ), ( __cuda_ptx_tensormap_replace_global_address_is_not_supported_before_SM_90a__(); )) }
                                                                                                                                                                                                               ^

    /home/lcw/micromamba/envs/myenv/include/cuda/std/detail/libcxx/include/__cuda/ptx/instructions/tensormap_replace.h(56): error: expected a ";"
        { _NV_IF__NV_TARGET_BOOL_NV_HAS_FEATURE_SM_90a(( asm ( "tensormap.replace.tile.global_address.global.b1024.b64    [%0], %1;" : : "l"(__as_ptr_gmem(__tm_addr)), "l"(__as_b64(__new_val)) : "memory" ); ), ( __cuda_ptx_tensormap_replace_global_address_is_not_supported_before_SM_90a__(); )) }
                                                                                                                                                                                                                                                                                                       ^

How to Reproduce

Copy the include directory of libcudacxx (main branch) into my conda env
Write a file that has #include <cuda/ptx>

Build it using a command like:

nvcc --generate-dependencies-with-compile --dependency-output build/my_kernel.o.d -I/home/lcw/micromamba/envs/myenv/lib/python3.10/site-packages/torch/include -I/home/lcw/micromamba/envs/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/lcw/micromamba/envs/myenv/lib/python3.10/site-packages/torch/include/TH -I/home/lcw/micromamba/envs/myenv/lib/python3.10/site-packages/torch/include/THC -I/home/lcw/micromamba/envs/myenv/include -I/data/home/lcw/micromamba/envs/dino2/include/python3.10 -c -c my_kernel.cu -o build/my_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -lineinfo -std=c++20 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --ptxas-options=-v -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=my_kernel -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_90a,code=sm_90a -ccbin /home/lcw/micromamba/envs/myenv/bin/x86_64-conda-linux-gnu-cc

Expected behavior

It builds.

Reproduction link

No response

Operating System

Ubuntu 20.04.1

nvidia-smi output

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:53:00.0 Off |                    0 |
| N/A   25C    P0              66W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  | 00000000:64:00.0 Off |                    0 |
| N/A   26C    P0              66W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  | 00000000:75:00.0 Off |                    0 |
| N/A   26C    P0              65W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  | 00000000:86:00.0 Off |                    0 |
| N/A   27C    P0              65W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  | 00000000:97:00.0 Off |                    0 |
| N/A   27C    P0              66W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  | 00000000:A8:00.0 Off |                    0 |
| N/A   25C    P0              68W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  | 00000000:B9:00.0 Off |                    0 |
| N/A   25C    P0              64W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  | 00000000:CA:00.0 Off |                    0 |
| N/A   25C    P0              65W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

NVCC version

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

$ x86_64-conda-linux-gnu-cc --version
x86_64-conda-linux-gnu-cc (conda-forge gcc 12.3.0-5) 12.3.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

miscco commented 6 months ago

Oh, I believe you are seeing a mismatch between installed driver and the toolkit you are using.

Looking at your nvidia-smi output you are running with

CUDA Version: 12.2

However, you are compiling with nvcc 12.3

Cuda compilation tools, release 12.3, V12.3.107

Now the issue is that we are enabling features based on what the compiler supports, which is the only information we have at compiletime. In this case, the 12.2 driver only supports PTX ISA 8.2, which does not include those instructions.

However, your toolkit is providing PTX ISA 8.3 which does so and you are compiling against a target that also could use them.

Long story short, I believe you need to update you driver on that machine

miscco commented 6 months ago

@lw also to be sure, can you verify that you also properly added the nv subfolder from the libcudacxx folder to the include path?

we did add changes to the <nv/target> header that need to be included

jrhemstad commented 6 months ago

Driver version isn't the issue here. It's as @miscco said, there are likely mismatched versions of nv/ and cuda/ headers, because when I try a simpler reproducer with the entire CCCL library, it works just fine: https://godbolt.org/z/sd9PehKWG

@lw CCCL components are not independent and so vendoring just the libcudacxx/include/cuda headers into your conda environment won't work.

ahendriksen commented 5 months ago

@lw Can we close this bug?

lw commented 5 months ago

I tried again, making sure I copy both the cuda and nv subdirectories, and all seems to work. Closing.

NVIDIA / cccl