Closed opfromthestart closed 1 year ago
nvidia-smi --query-gpu compute_cap --format=csv
gives compute_cap 7.5
nvcc --version
gives
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
nvcc --list-gpu-code
gives
sm_50
sm_52
sm_53
sm_60
sm_61
sm_62
sm_70
sm_72
sm_75
sm_80
sm_86
sm_87
sm_89
sm_90
Can you expand on what you mean by updated your drivers? I guess I had assumed all compute_caps of the same number would have similar issues, but you're still compiling with 75 and getting this error?
My drivers were on version 525 and I had version 11.6 and 12.1 of all cuda-related libraries. I installed version 530 of the drivers and removed the 11.6 versions of the libraries, and that made the llama-dfdx example work.
When I try to use the 525 version of drivers I get the following error
Caused by:
process didn't exit successfully: `/home/opfromthestart/rust/game/touhou-diff/target/release/build/dfdx-5455800ceba8656f/build-script-build` (exit status: 101)
--- stdout
cargo:rerun-if-changed=build.rs
cargo:rustc-cfg=feature="nightly"
cargo:rustc-env=CUDA_INCLUDE_DIR=/usr/local/cuda/include
cargo:rerun-if-changed=src/tensor_ops/utilities/binary_op_macros.cuh
cargo:rerun-if-changed=src/tensor_ops/utilities/compatibility.cuh
cargo:rerun-if-changed=src/tensor_ops/utilities/cuda_utils.cuh
cargo:rerun-if-changed=src/tensor_ops/utilities/unary_op_macros.cuh
--- stderr
thread 'main' panicked at 'assertion failed: `(left == right)`
left: `"Failed to initialize NVML: Driver/library version mismatch"`,
right: `"compute_cap"`', /home/opfromthestart/.cargo/git/checkouts/dfdx-318e6e5ad83eea79/5e2b93d/build.rs:132:17
Which was why I upgraded to 530
Ahh okay, so are you still having the original error then about hmin/hmax?
Maybe we should be hooking into driver versions instead of GPU_ARCH for the ifdefs? I wonder if thats available...
Hmm it seems like getting driver version is limited to runtime. 🤔
I also get this error. I setup a vscode dev container with the following .devcontainer/devcontainer.json
{
"name": "Rust",
"image": "nvidia/cuda:12.1.1-devel-ubuntu20.04",
"runArgs": [
"--gpus",
"all"
],
"features": {
"ghcr.io/devcontainers/features/rust:1": {}
}
}
Running nvidia-smi
in the container gives.
root@e5d2279e80a7:/workspaces/dfdx# nvidia-smi
Sat May 13 11:23:01 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:1D:00.0 On | N/A |
| 0% 39C P0 N/A / 90W | 1273MiB / 4096MiB | 6% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Running cargo test --features "cuda"
root@e5d2279e80a7:/workspaces/dfdx# cargo test --features "cuda"
Compiling dfdx v0.11.2 (/workspaces/dfdx)
error: failed to run custom build command for `dfdx v0.11.2 (/workspaces/dfdx)`
Caused by:
process didn't exit successfully: `/workspaces/dfdx/target/debug/build/dfdx-30e6be024c8b3335/build-script-build` (exit status: 101)
--- stdout
cargo:rerun-if-changed=build.rs
cargo:rustc-env=CUDA_INCLUDE_DIR=/usr/local/cuda/include
cargo:rerun-if-changed=src/tensor_ops/utilities/binary_op_macros.cuh
cargo:rerun-if-changed=src/tensor_ops/utilities/compatibility.cuh
cargo:rerun-if-changed=src/tensor_ops/utilities/cuda_utils.cuh
cargo:rerun-if-changed=src/tensor_ops/utilities/unary_op_macros.cuh
cargo:rerun-if-env-changed=CUDA_COMPUTE_CAP
cargo:rustc-env=CUDA_COMPUTE_CAP=sm_61
cargo:rerun-if-changed=src/optim/adam/adam.cu
cargo:rerun-if-changed=src/optim/rmsprop/rmsprop.cu
cargo:rerun-if-changed=src/optim/sgd/sgd.cu
cargo:rerun-if-changed=src/tensor_ops/abs/abs.cu
cargo:rerun-if-changed=src/tensor_ops/add/binary_add.cu
cargo:rerun-if-changed=src/tensor_ops/add/scalar_add.cu
cargo:rerun-if-changed=src/tensor_ops/attention_reshape/attention_reshape.cu
cargo:rerun-if-changed=src/tensor_ops/axpy/axpy.cu
cargo:rerun-if-changed=src/tensor_ops/bce/bce.cu
cargo:rerun-if-changed=src/tensor_ops/boolean/boolean.cu
cargo:rerun-if-changed=src/tensor_ops/choose/choose.cu
cargo:rerun-if-changed=src/tensor_ops/clamp/clamp.cu
cargo:rerun-if-changed=src/tensor_ops/cmp/cmp.cu
cargo:rerun-if-changed=src/tensor_ops/conv2d/conv2d.cu
cargo:rerun-if-changed=src/tensor_ops/convtrans2d/convtrans2d.cu
cargo:rerun-if-changed=src/tensor_ops/cos/cos.cu
cargo:rerun-if-changed=src/tensor_ops/div/binary_div.cu
cargo:rerun-if-changed=src/tensor_ops/div/scalar_div.cu
cargo:rerun-if-changed=src/tensor_ops/dropout/dropout.cu
cargo:rerun-if-changed=src/tensor_ops/exp/exp.cu
cargo:rerun-if-changed=src/tensor_ops/gelu/gelu.cu
cargo:rerun-if-changed=src/tensor_ops/huber_error/huber_error.cu
cargo:rerun-if-changed=src/tensor_ops/ln/ln.cu
cargo:rerun-if-changed=src/tensor_ops/max_to/max_to.cu
cargo:rerun-if-changed=src/tensor_ops/maximum/maximum.cu
cargo:rerun-if-changed=src/tensor_ops/min_to/min_to.cu
cargo:rerun-if-changed=src/tensor_ops/minimum/minimum.cu
cargo:rerun-if-changed=src/tensor_ops/mul/binary_mul.cu
cargo:rerun-if-changed=src/tensor_ops/mul/scalar_mul.cu
cargo:rerun-if-changed=src/tensor_ops/nans_to/nans_to.cu
cargo:rerun-if-changed=src/tensor_ops/negate/negate.cu
cargo:rerun-if-changed=src/tensor_ops/pool2d/pool2d.cu
cargo:rerun-if-changed=src/tensor_ops/pow/pow.cu
cargo:rerun-if-changed=src/tensor_ops/recip/recip.cu
cargo:rerun-if-changed=src/tensor_ops/relu/relu.cu
cargo:rerun-if-changed=src/tensor_ops/roll/roll.cu
cargo:rerun-if-changed=src/tensor_ops/select_and_gather/gather.cu
cargo:rerun-if-changed=src/tensor_ops/select_and_gather/select.cu
cargo:rerun-if-changed=src/tensor_ops/sigmoid/sigmoid.cu
cargo:rerun-if-changed=src/tensor_ops/sin/sin.cu
cargo:rerun-if-changed=src/tensor_ops/slice/slice.cu
cargo:rerun-if-changed=src/tensor_ops/sqrt/sqrt.cu
cargo:rerun-if-changed=src/tensor_ops/square/square.cu
cargo:rerun-if-changed=src/tensor_ops/sub/binary_sub.cu
cargo:rerun-if-changed=src/tensor_ops/sub/scalar_sub.cu
cargo:rerun-if-changed=src/tensor_ops/sum_to/sum_to.cu
cargo:rerun-if-changed=src/tensor_ops/tanh/tanh.cu
cargo:rerun-if-changed=src/tensor_ops/upscale2d/upscale2d.cu
--- stderr
thread 'main' panicked at 'nvcc error while compiling "src/optim/adam/adam.cu":
# stdout
# stderr
src/tensor_ops/utilities/compatibility.cuh(9): error: function "__hmax" has already been defined
__attribute__((device)) __inline__ __attribute__((always_inline)) __half __hmax(__half a, __half b) {
^
src/tensor_ops/utilities/compatibility.cuh(12): error: function "__hmin" has already been defined
__attribute__((device)) __inline__ __attribute__((always_inline)) __half __hmin(__half a, __half b) {
^
2 errors detected in the compilation of "src/optim/adam/adam.cu".
', build.rs:197:17
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
nvcc --version
root@e5d2279e80a7:/workspaces/dfdx# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
When I try to compile dfdx while using cuda, I get the following error
My guess is that its related to the fix for compatibility of 75, which I think I had but I updated my drivers so it now it is already defined.