EricLBuehler / mistral.rs

Blazingly fast LLM inference.
MIT License
3.37k stars 243 forks source link

Bug: Container image fails to start due to CUDA version mismatch #596

Open sammcj opened 1 month ago

sammcj commented 1 month ago

Describe the bug

When trying to run the latest official cuda-86 image (ghcr.io/ericlbuehler/mistral.rs:cuda-86-latest) mistralrs server fails to load with an error stating it can't find the cuda libs in the LD_LIBRARY_PATH:

Unable to dynamically load the "cuda" shared library - searched for library names: ["cuda", "nvcuda"]. Ensure that `LD_LIBRARY_PATH` has the correct path to the installed library. If the shared library is present on the system under a different name than one of those listed above, please open a GitHub issue.
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: cudarc::panic_no_lib_found
   3: std::sys::sync::once::futex::Once::call
   4: std::sync::once_lock::OnceLock<T>::initialize
   5: cudarc::driver::safe::core::CudaDevice::new
   6: <candle_core::cuda_backend::device::CudaDevice as candle_core::backend::BackendDevice>::new
   7: candle_core::device::Device::cuda_if_available
   8: mistralrs_server::main::{{closure}}
   9: tokio::runtime::park::CachedParkThread::block_on
  10: tokio::runtime::context::runtime::enter_runtime
  11: tokio::runtime::runtime::Runtime::block_on
  12: mistralrs_server::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Overriding LD_LIBRARY_PATH to include the path to libcuda.so inside the container (/usr/local/cuda-12.4/compat) then reveals that the problem is a CUDA version mismatch:

LD_LIBRARY_PATH= /usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda-12.4/compat


Stack backtrace:
   0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
   1: mistralrs_server::main::{{closure}}
   2: tokio::runtime::park::CachedParkThread::block_on
   3: tokio::runtime::context::runtime::enter_runtime
   4: tokio::runtime::runtime::Runtime::block_on
   5: mistralrs_server::main
   6: std::sys::backtrace::__rust_begin_short_backtrace
   7: std::rt::lang_start::{{closure}}
   8: std::rt::lang_start_internal
   9: main
  10: <unknown>
  11: __libc_start_main
  12: _start
Error: DriverError(CUDA_ERROR_SYSTEM_DRIVER_MISMATCH, "system has unsupported display driver / cuda driver combination")
   0: candle_core::error::Error::bt
   1: <candle_core::cuda_backend::device::CudaDevice as candle_core::backend::BackendDevice>::new
   2: candle_core::device::Device::cuda_if_available
   3: mistralrs_server::main::{{closure}}
   4: tokio::runtime::park::CachedParkThread::block_on
   5: tokio::runtime::context::runtime::enter_runtime
   6: tokio::runtime::runtime::Runtime::block_on
   7: mistralrs_server::main
   8: std::sys::backtrace::__rust_begin_short_backtrace
   9: std::rt::lang_start::{{closure}}
  10: std::rt::lang_start_internal
  11: main
  12: <unknown>
  13: __libc_start_main
  14: _start

Stack backtrace:
   0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
   1: mistralrs_server::main::{{closure}}
   2: tokio::runtime::park::CachedParkThread::block_on
   3: tokio::runtime::context::runtime::enter_runtime
   4: tokio::runtime::runtime::Runtime::block_on
   5: mistralrs_server::main
   6: std::sys::backtrace::__rust_begin_short_backtrace
   7: std::rt::lang_start::{{closure}}
   8: std::rt::lang_start_internal
   9: main
  10: <unknown>
  11: __libc_start_main
  12: _start

docker inspect shows the env to be:

            "Env": [
                "KEEP_ALIVE_INTERVAL=100",
                "RUST_BACKTRACE=1",
                "LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda-12.4",
                "PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "NVARCH=x86_64",
                "NVIDIA_REQUIRE_CUDA=cuda>=12.4 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,driver>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536",
                "NV_CUDA_CUDART_VERSION=12.4.127-1",
                "NV_CUDA_COMPAT_PACKAGE=cuda-compat-12-4",
                "CUDA_VERSION=12.4.1",
                "NVIDIA_VISIBLE_DEVICES=all",
                "NVIDIA_DRIVER_CAPABILITIES=compute,utility",
                "NV_CUDA_LIB_VERSION=12.4.1-1",
                "NV_NVTX_VERSION=12.4.127-1",
                "NV_LIBNPP_VERSION=12.2.5.30-1",
                "NV_LIBNPP_PACKAGE=libnpp-12-4=12.2.5.30-1",
                "NV_LIBCUSPARSE_VERSION=12.3.1.170-1",
                "NV_LIBCUBLAS_PACKAGE_NAME=libcublas-12-4",
                "NV_LIBCUBLAS_VERSION=12.4.5.8-1",
                "NV_LIBCUBLAS_PACKAGE=libcublas-12-4=12.4.5.8-1",
                "NV_LIBNCCL_PACKAGE_NAME=libnccl2",
                "NV_LIBNCCL_PACKAGE_VERSION=2.21.5-1",
                "NCCL_VERSION=2.21.5-1",
                "NV_LIBNCCL_PACKAGE=libnccl2=2.21.5-1+cuda12.4",
                "NVIDIA_PRODUCT_NAME=CUDA",
                "NV_CUDNN_VERSION=9.1.0.70-1",
                "NV_CUDNN_PACKAGE_NAME=libcudnn9-cuda-12",
                "NV_CUDNN_PACKAGE=libcudnn9-cuda-12=9.1.0.70-1",
                "HUGGINGFACE_HUB_CACHE=/data",
                "PORT=80",
                "RAYON_NUM_THREADS=8"

Latest commit

Host system:

sammcj commented 1 month ago

FYI I did a custom build with CUDA 12.5.1 as the base image and it had the same issue.

JackCloudman commented 6 days ago

same issue 1xA6000 + 1x4090

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

CUDA Version: 12.2

JackCloudman commented 6 days ago

I fixed the following:

  1. Uninstall cuda
sudo apt-get --purge remove "*cuda*"
sudo apt-get autoremove
  1. Follow the steps here: https://developer.nvidia.com/cuda-downloads I used the next commands: wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
    sudo dpkg -i cuda-keyring_1.1-1_all.deb
    sudo apt-get update
    sudo apt-get -y install cuda-12-4
  2. Restart xd After this, I compiled the code correctly. Remember to add your LD Library Path.