__hmin/__hmax already defined on compute_cap 75 with newer driver version

opfromthestart commented 1 year ago

When I try to compile dfdx while using cuda, I get the following error

  --- stderr
  thread 'main' panicked at 'nvcc error while compiling "src/optim/adam/adam.cu":

  # stdout

  # stderr
  src/tensor_ops/utilities/compatibility.cuh(9): error: function "__hmax" has already been defined
    __attribute__((device)) __inline__ __attribute__((always_inline)) __half __hmax(__half a, __half b) {
                                                                             ^

  src/tensor_ops/utilities/compatibility.cuh(12): error: function "__hmin" has already been defined
    __attribute__((device)) __inline__ __attribute__((always_inline)) __half __hmin(__half a, __half b) {
                                                                             ^

  2 errors detected in the compilation of "src/optim/adam/adam.cu".

My guess is that its related to the fix for compatibility of 75, which I think I had but I updated my drivers so it now it is already defined.

opfromthestart commented 1 year ago

nvidia-smi --query-gpu compute_cap --format=csv gives compute_cap 7.5 nvcc --version gives

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

nvcc --list-gpu-code gives

sm_50
sm_52
sm_53
sm_60
sm_61
sm_62
sm_70
sm_72
sm_75
sm_80
sm_86
sm_87
sm_89
sm_90

coreylowman commented 1 year ago

Can you expand on what you mean by updated your drivers? I guess I had assumed all compute_caps of the same number would have similar issues, but you're still compiling with 75 and getting this error?

opfromthestart commented 1 year ago

My drivers were on version 525 and I had version 11.6 and 12.1 of all cuda-related libraries. I installed version 530 of the drivers and removed the 11.6 versions of the libraries, and that made the llama-dfdx example work.

opfromthestart commented 1 year ago

When I try to use the 525 version of drivers I get the following error

Caused by:
  process didn't exit successfully: `/home/opfromthestart/rust/game/touhou-diff/target/release/build/dfdx-5455800ceba8656f/build-script-build` (exit status: 101)
  --- stdout
  cargo:rerun-if-changed=build.rs
  cargo:rustc-cfg=feature="nightly"
  cargo:rustc-env=CUDA_INCLUDE_DIR=/usr/local/cuda/include
  cargo:rerun-if-changed=src/tensor_ops/utilities/binary_op_macros.cuh
  cargo:rerun-if-changed=src/tensor_ops/utilities/compatibility.cuh
  cargo:rerun-if-changed=src/tensor_ops/utilities/cuda_utils.cuh
  cargo:rerun-if-changed=src/tensor_ops/utilities/unary_op_macros.cuh

  --- stderr
  thread 'main' panicked at 'assertion failed: `(left == right)`
    left: `"Failed to initialize NVML: Driver/library version mismatch"`,
   right: `"compute_cap"`', /home/opfromthestart/.cargo/git/checkouts/dfdx-318e6e5ad83eea79/5e2b93d/build.rs:132:17

Which was why I upgraded to 530

coreylowman commented 1 year ago

Ahh okay, so are you still having the original error then about hmin/hmax?

Maybe we should be hooking into driver versions instead of GPU_ARCH for the ifdefs? I wonder if thats available...

coreylowman commented 1 year ago

Hmm it seems like getting driver version is limited to runtime. 🤔

9876691 commented 1 year ago

I also get this error. I setup a vscode dev container with the following .devcontainer/devcontainer.json

{
    "name": "Rust",
    "image": "nvidia/cuda:12.1.1-devel-ubuntu20.04", 

    "runArgs": [
        "--gpus",
        "all"
    ],
    "features": {
        "ghcr.io/devcontainers/features/rust:1": {}
    }
}

Running nvidia-smi in the container gives.

root@e5d2279e80a7:/workspaces/dfdx# nvidia-smi
Sat May 13 11:23:01 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:1D:00.0  On |                  N/A |
|  0%   39C    P0    N/A /  90W |   1273MiB /  4096MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Running cargo test --features "cuda"

root@e5d2279e80a7:/workspaces/dfdx# cargo test --features "cuda"
   Compiling dfdx v0.11.2 (/workspaces/dfdx)
error: failed to run custom build command for `dfdx v0.11.2 (/workspaces/dfdx)`

Caused by:
  process didn't exit successfully: `/workspaces/dfdx/target/debug/build/dfdx-30e6be024c8b3335/build-script-build` (exit status: 101)
  --- stdout
  cargo:rerun-if-changed=build.rs
  cargo:rustc-env=CUDA_INCLUDE_DIR=/usr/local/cuda/include
  cargo:rerun-if-changed=src/tensor_ops/utilities/binary_op_macros.cuh
  cargo:rerun-if-changed=src/tensor_ops/utilities/compatibility.cuh
  cargo:rerun-if-changed=src/tensor_ops/utilities/cuda_utils.cuh
  cargo:rerun-if-changed=src/tensor_ops/utilities/unary_op_macros.cuh
  cargo:rerun-if-env-changed=CUDA_COMPUTE_CAP
  cargo:rustc-env=CUDA_COMPUTE_CAP=sm_61
  cargo:rerun-if-changed=src/optim/adam/adam.cu
  cargo:rerun-if-changed=src/optim/rmsprop/rmsprop.cu
  cargo:rerun-if-changed=src/optim/sgd/sgd.cu
  cargo:rerun-if-changed=src/tensor_ops/abs/abs.cu
  cargo:rerun-if-changed=src/tensor_ops/add/binary_add.cu
  cargo:rerun-if-changed=src/tensor_ops/add/scalar_add.cu
  cargo:rerun-if-changed=src/tensor_ops/attention_reshape/attention_reshape.cu
  cargo:rerun-if-changed=src/tensor_ops/axpy/axpy.cu
  cargo:rerun-if-changed=src/tensor_ops/bce/bce.cu
  cargo:rerun-if-changed=src/tensor_ops/boolean/boolean.cu
  cargo:rerun-if-changed=src/tensor_ops/choose/choose.cu
  cargo:rerun-if-changed=src/tensor_ops/clamp/clamp.cu
  cargo:rerun-if-changed=src/tensor_ops/cmp/cmp.cu
  cargo:rerun-if-changed=src/tensor_ops/conv2d/conv2d.cu
  cargo:rerun-if-changed=src/tensor_ops/convtrans2d/convtrans2d.cu
  cargo:rerun-if-changed=src/tensor_ops/cos/cos.cu
  cargo:rerun-if-changed=src/tensor_ops/div/binary_div.cu
  cargo:rerun-if-changed=src/tensor_ops/div/scalar_div.cu
  cargo:rerun-if-changed=src/tensor_ops/dropout/dropout.cu
  cargo:rerun-if-changed=src/tensor_ops/exp/exp.cu
  cargo:rerun-if-changed=src/tensor_ops/gelu/gelu.cu
  cargo:rerun-if-changed=src/tensor_ops/huber_error/huber_error.cu
  cargo:rerun-if-changed=src/tensor_ops/ln/ln.cu
  cargo:rerun-if-changed=src/tensor_ops/max_to/max_to.cu
  cargo:rerun-if-changed=src/tensor_ops/maximum/maximum.cu
  cargo:rerun-if-changed=src/tensor_ops/min_to/min_to.cu
  cargo:rerun-if-changed=src/tensor_ops/minimum/minimum.cu
  cargo:rerun-if-changed=src/tensor_ops/mul/binary_mul.cu
  cargo:rerun-if-changed=src/tensor_ops/mul/scalar_mul.cu
  cargo:rerun-if-changed=src/tensor_ops/nans_to/nans_to.cu
  cargo:rerun-if-changed=src/tensor_ops/negate/negate.cu
  cargo:rerun-if-changed=src/tensor_ops/pool2d/pool2d.cu
  cargo:rerun-if-changed=src/tensor_ops/pow/pow.cu
  cargo:rerun-if-changed=src/tensor_ops/recip/recip.cu
  cargo:rerun-if-changed=src/tensor_ops/relu/relu.cu
  cargo:rerun-if-changed=src/tensor_ops/roll/roll.cu
  cargo:rerun-if-changed=src/tensor_ops/select_and_gather/gather.cu
  cargo:rerun-if-changed=src/tensor_ops/select_and_gather/select.cu
  cargo:rerun-if-changed=src/tensor_ops/sigmoid/sigmoid.cu
  cargo:rerun-if-changed=src/tensor_ops/sin/sin.cu
  cargo:rerun-if-changed=src/tensor_ops/slice/slice.cu
  cargo:rerun-if-changed=src/tensor_ops/sqrt/sqrt.cu
  cargo:rerun-if-changed=src/tensor_ops/square/square.cu
  cargo:rerun-if-changed=src/tensor_ops/sub/binary_sub.cu
  cargo:rerun-if-changed=src/tensor_ops/sub/scalar_sub.cu
  cargo:rerun-if-changed=src/tensor_ops/sum_to/sum_to.cu
  cargo:rerun-if-changed=src/tensor_ops/tanh/tanh.cu
  cargo:rerun-if-changed=src/tensor_ops/upscale2d/upscale2d.cu

  --- stderr
  thread 'main' panicked at 'nvcc error while compiling "src/optim/adam/adam.cu":

  # stdout

  # stderr
  src/tensor_ops/utilities/compatibility.cuh(9): error: function "__hmax" has already been defined
    __attribute__((device)) __inline__ __attribute__((always_inline)) __half __hmax(__half a, __half b) {
                                                                             ^

  src/tensor_ops/utilities/compatibility.cuh(12): error: function "__hmin" has already been defined
    __attribute__((device)) __inline__ __attribute__((always_inline)) __half __hmin(__half a, __half b) {
                                                                             ^

  2 errors detected in the compilation of "src/optim/adam/adam.cu".
  ', build.rs:197:17
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

nvcc --version

root@e5d2279e80a7:/workspaces/dfdx# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

coreylowman / dfdx

hmin/hmax already defined on compute_cap 75 with newer driver version #762

coreylowman / dfdx

__hmin/__hmax already defined on compute_cap 75 with newer driver version #762

hmin/hmax already defined on compute_cap 75 with newer driver version #762