Open rplescia opened 1 month ago
I am having the same issue.
The fact that it complains the kernel wasn't compiled for an arch version (700) that is later listed in the compiled list is very LoL.
mmq.cuh:2422: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 700. ggml-cuda.cu was compiled for: 500,520,530,600,610,620,700,720,750,800,860,870,890,900
Environment: Rocky Linux 9.3 Python 3.11 NVIDIA 545 CUDA 12.3 V100 32GB (CC 7.0 -> arch 700)
nvidia smi - """+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla V100-PCIE-16GB Off | 00000001:00:00.0 Off | 0 | | N/A 34C P0 37W / 250W | 3096MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE-16GB Off | 00000002:00:00.0 Off | Off | | N/A 34C P0 37W / 250W | 3316MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+""" . llamacpp python version >0.2.85 to leverage llama 3.1. Llama quant version initialization: """ llm = Llama.from_pretrained(
repo_id="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF",
filename="Meta-Llama-3.1-8B-Instruct-Q4_K_L.gguf",
n_ctx=4096,
n_gpu_layers=-1
)
""". """ ggml_cuda_compute_forward: ROPE failed
CUDA error: unspecified launch failure
current device: 0, in function ggml_cuda_compute_forward at /home/runner/work/llama-cpp-python/llama-cpp-python/vendor/llama.cpp/ggml/src/ggml-cuda.cu:2313
err
/home/runner/work/llama-cpp-python/llama-cpp-python/vendor/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 700. ggml-cuda.cu was compiled for: 500,520,530,600,610,620,700,720,750,800,860,870,890,900
/home/runner/work/llama-cpp-python/llama-cpp-python/vendor/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 700. ggml-cuda.cu was compiled for: 500,520,530,600,610,620,700,720,750,800,860,870,890,900""". ROPE is failing. For large token count inputs, the above error occurs and inferencing occurs on CPU only. Please advice on resolution
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
I'm running the llama-cpp-python OpenAI-compatible API servers on my VM that has 1x Nvidia V100 16GB GPU allocated to it. The server can start, but once a request is sent to the server, it falls over.
Current Behavior
When the server receives a request it errors with a CUDA error. The error is identical to an issue already reported in the Ollama GitHub page https://github.com/ollama/ollama/issues/5571
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 6 On-line CPU(s) list: 0-5 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz CPU family: 6 Model: 79 Thread(s) per core: 1 Core(s) per socket: 6 Socket(s): 1 Stepping: 1 BogoMIPS: 5187.98 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid ss e4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt md_clear Virtualization features: Hypervisor vendor: Microsoft Virtualization type: full Caches (sum of all):
L1d: 192 KiB (6 instances) L1i: 192 KiB (6 instances) L2: 1.5 MiB (6 instances) L3: 35 MiB (1 instance) NUMA:
NUMA node(s): 1 NUMA node0 CPU(s): 0-5 Vulnerabilities:
Gather data sampling: Not affected Itlb multihit: KVM: Mitigation: VMX unsupported L1tf: Mitigation; PTE Inversion Mds: Mitigation; Clear CPU buffers; SMT Host state unknown Meltdown: Mitigation; PTI Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Vulnerable Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Retpoline Srbds: Not affected Tsx async abort: Mitigation; Clear CPU buffers; SMT Host state unknown
22.04.1-Ubuntu SMP Mon Jun 17 18:38:57 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Failure Information (for bugs)