Open quietlychris opened 1 year ago
Edited because I'm occasionally Very Dumb(TM) and forgot to actually run this with the GPU passthrough. The following comment is now accurate.
Edit 2: Except this doesn't work with the PyTorch example in image-classification
either, so I guess maybe it's something about the NVIDIA Docker image itself :upside_down_face: I'll update whenever I happen to regain the willpower to continue exploring this.
I need to basically remove the nvidia-smi
section of the build script in favor of nvcc
, but this allows dfdx
to compile with cuda
enabled, but can't actually run the test suite, exiting with an error
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE, "forward compatibility was attempted on non supported HW")', /usr/local/cargo/git/checkouts/cudarc-2602ad613d9c0487/cc9a8d3/src/driver/safe/core.rs:50:24
In addition, it's recommended to add pkg-config
and libssl-dev
to the apt-get install
list.
I'm not sure if you got it working, but I'm trying to learn ML while using this lib and this is my dockerfile dev env:
FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04
# basic tools
RUN apt update \
&& apt install -y --no-install-recommends \
git vim openssh-client gnupg curl wget ca-certificates unzip zip less zlib1g sudo coreutils sed grep
#
# cargo/rust
ENV RUSTUP_HOME=/usr/local/rustup
ENV CARGO_HOME=/usr/local/cargo
ENV PATH=/usr/local/cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
# https://blog.rust-lang.org/2022/06/22/sparse-registry-testing.html
ENV CARGO_UNSTABLE_SPARSE_REGISTRY=true
RUN set -eux; \
apt update \
&& apt install -y --no-install-recommends \
ca-certificates gcc build-essential; \
url="https://static.rust-lang.org/rustup/dist/x86_64-unknown-linux-gnu/rustup-init"; \
wget "$url"; \
chmod +x rustup-init; \
./rustup-init -y --no-modify-path --default-toolchain nightly; \
rm rustup-init; \
chmod -R a+w $RUSTUP_HOME $CARGO_HOME; \
rustup --version; \
cargo --version; \
rustc --version;
#
# https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#environment-setup
RUN echo "export PATH=/usr/local/cuda-12.1/bin${PATH:+:${PATH}}" >> ~/.bashrc
Thats for:
[dependencies.dfdx]
version = "0.13.0"
default-features = false
features = [
"std",
"fast-alloc",
"cpu",
"cuda",
"cudnn",
"safetensors",
"numpy",
"nightly",
]
Hello,
I spent part of this afternoon banging my head against a wall with getting
dfdx
with thecuda
feature enabled up and running on my computer. It turns a big part of this appeared to be that my version (11.2) doesn't really appear to work well with thebuild.rs
script, with errors appearing in multiple steps. As I think I may have mentioned in previous issues, my set-up isn't particularly exotic (just the recent Pop!_OS release with the default NVIDIA drivers), so I suspect that other folks may run into the same issue.According to System76's docs, the recommended way of dealing with a CUDA version mismatch is just to use Docker. While this isn't ideal (I don't love having to rely on Docker), I can confirm that this solved most of my build issues, by first following the GPU-enabled container instructions in the link above, then building a
dfdx
-specific container using the Dockerfile below (which takes a hot minute to build).I was just thinking that it might be worth considering adding this kind of process into the crate's documentation to help other people that may run into the same issue, at least until it becomes clear that the base NVIDIA-enabled system configurations being shipped with distro's like Pop!_OS/Ubuntu are able to support the
dfdx
's build script.