EricLBuehler / mistral.rs

Blazingly fast LLM inference.
MIT License
4.37k stars 302 forks source link

0.3.1 #862 new build failure, stop at mistralrs-quant #866

Open misureaudio opened 1 week ago

misureaudio commented 1 week ago

Minimum reproducible example

cargo build --release --features cuda

Error

error: failed to run custom build command for mistralrs-quant v0.3.1 (C:\Users\misur\Desktop\rustsrc\mistral.rs.0.3.1.0862\mistralrs-quant)

Caused by: process didn't exit successfully: C:\Users\misur\Desktop\rustsrc\mistral.rs.0.3.1.0862\target\release\build\mistralrs-quant-a5b0a5658b3f8319\build-script-build (exit code: 101) --- stdout cargo:rerun-if-changed=build.rs cargo:rerun-if-changed=kernels/gptq/q_gemm.cu cargo:rerun-if-changed=kernels/hqq/hqq.cu cargo:rerun-if-changed=kernels/ops/ops.cu cargo:rerun-if-changed=kernels/marlin/marlin_kernel.cu cargo:info=["/usr", "/usr/local/cuda", "/opt/cuda", "/usr/lib/cuda", "C:/Program Files/NVIDIA GPU Computing Toolkit", "C:/CUDA"] cargo:rerun-if-env-changed=CUDA_COMPUTE_CAP cargo:rustc-env=CUDA_COMPUTE_CAP=75

Other information

Please specify: Windows 11

Sat Oct 19 14:49:11 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 565.90 Driver Version: 565.90 CUDA Version: 12.7 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce GTX 1650 ... WDDM | 00000000:01:00.0 Off | N/A | | N/A 46C P8 3W / 40W | 0MiB / 4096MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+

Latest commit or version

0.3.1 #862

misureaudio commented 1 week ago

No problem in building #820

862 build on one of my Win11 laptops, all relevant libs and sw on disk C:

862 doesn't build on a second laptop, having CUDA toolkit on D:, however all preceding #xxx are built ok.

862 doesn't build on a third laptop, with all relevant lib and sw on C:

misureaudio commented 1 week ago

BTW, #859 (pre metal fix) compiled with no issue.

gfxenjoyer commented 1 week ago

kernels/marlin/marlin_kernel.cu fails to compile on --gpu-architecture=sm_75. Seems to work fine for 80, 86, 89, and 90. I manually tested by setting $env:CUDA_COMPUTE_CAP="75".

misureaudio commented 1 week ago

Fine on a Quadro A2000, CI 86

misureaudio commented 1 week ago

OK on 4070 laptop. CI 89, No go on GTX1650 CI 75 No go on GTX1070 CI 61

DenisBobrovskiy commented 1 week ago

Pretty sure it is caused by Marlin kernel support that was added in #856 Try falling back to #848 . Marlin kernels are built for Compute Capability of 8+

misureaudio commented 1 week ago

Could it be feasible to allow backward compatibility? Even a GTX 1080 with CI=6.1, having 8GB VRAM, could be a useful asset. A slower execution could be ok, if one can follow the future developments, (essentially support for new models).

EricLBuehler commented 1 week ago

@misureaudio that makes sense. I'll merge a nice solution!

EricLBuehler commented 6 days ago

@misureaudio @DenisBobrovskiy I just merged #878 which only compiles & runs the Marlin kernels if the compute cap is appropriate, can you please confirm if it works?

DenisBobrovskiy commented 6 days ago

@EricLBuehler fails at

  thread 'main' panicked at mistralrs-quant\build.rs:19:64:
  called `Result::unwrap()` on an `Err` value: ParseFloatError { kind: Invalid }

in this code (output.split('\n').nth(1).unwrap().parse::<f32>().unwrap() * 100.) as usize. I think it is because is not trimmed, this fixed it for me: (output.split('\n').nth(1).unwrap().trim().parse::<f32>().unwrap() * 100.) as usize

misureaudio commented 6 days ago

@EricLBuehler @DenisBobrovskiy , both mods are needed, and all works ok: build, install, mistralrs-server.exe works on a GTX1080, CI 6.1:

2024-10-24T08:36:26.399093Z INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 6.1 2024-10-24T08:36:26.399315Z INFO mistralrs_core::utils::normal: Skipping BF16 because CC < 8.0 2024-10-24T08:36:26.510733Z INFO mistralrs_core::utils::normal: DType selected is F16. 100%|████████████████████████████████████████████████████████████████████████████████████| 85/85 [02:56<00:00, 0.54it/s] 100%|████████████████████████████████████████████████████████████████████████████████| 507/507 [03:24<00:00, 190.28it/s] 2024-10-24T08:39:56.301282Z INFO mistralrs_core::pipeline::isq: Applying in-situ quantization into Some(Q4K) to 129 tensors. 2024-10-24T08:39:56.302138Z INFO mistralrs_core::pipeline::isq: Applying ISQ on 12 threads. [00:00:15] [###################################>----] 113/129 (2s)

Confirmed!

Thank You very much!

DenisBobrovskiy commented 6 days ago

@EricLBuehler #880 this should fix the issue i mentioned