rplescia commented 1 month ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[x ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[ x] I carefully followed the README.md.
[ x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[ x] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I'm running the llama-cpp-python OpenAI-compatible API servers on my VM that has 1x Nvidia V100 16GB GPU allocated to it. The server can start, but once a request is sent to the server, it falls over.

Current Behavior

When the server receives a request it errors with a CUDA error. The error is identical to an issue already reported in the Ollama GitHub page https://github.com/ollama/ollama/issues/5571

/home/runner/work/llama-cpp-python/llama-cpp-python/vendor/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2422: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 700. ggml-cuda.cu was compiled for: 500,520,530,600,610,620,700,720,750,800,860,870,890,900 ggml_cuda_compute_forward: ROPE failed CUDA error: unspecified launch failure current device: 0, in function ggml_cuda_compute_forward at /home/runner/work/llama-cpp-python/llama-cpp-python/vendor/llama.cpp/ggml/src/ggml-cuda.cu:2287 err GGML_ASSERT: /home/runner/work/llama-cpp-python/llama-cpp-python/vendor/llama.cpp/ggml/src/ggml-cuda.cu:101: !"CUDA error" Aborted (core dumped)

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

Physical (or virtual) hardware you are using, e.g. for Linux:

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 6 On-line CPU(s) list: 0-5 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz CPU family: 6 Model: 79 Thread(s) per core: 1 Core(s) per socket: 6 Socket(s): 1 Stepping: 1 BogoMIPS: 5187.98 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid ss e4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt md_clear Virtualization features: Hypervisor vendor: Microsoft Virtualization type: full Caches (sum of all):
L1d: 192 KiB (6 instances) L1i: 192 KiB (6 instances) L2: 1.5 MiB (6 instances) L3: 35 MiB (1 instance) NUMA:
NUMA node(s): 1 NUMA node0 CPU(s): 0-5 Vulnerabilities:
Gather data sampling: Not affected Itlb multihit: KVM: Mitigation: VMX unsupported L1tf: Mitigation; PTE Inversion Mds: Mitigation; Clear CPU buffers; SMT Host state unknown Meltdown: Mitigation; PTI Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Vulnerable Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Retpoline Srbds: Not affected Tsx async abort: Mitigation; Clear CPU buffers; SMT Host state unknown

Operating System, e.g. for Linux:

22.04.1-Ubuntu SMP Mon Jun 17 18:38:57 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

SDK version, e.g. for Linux:

Python 3.10.12
GNU Make 4.3
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

Failure Information (for bugs)

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /home/...../models/mistral-7b-instruct-v0.2.Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.2 llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 17 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "~~", "~~", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q5_K: 193 tensors llama_model_loader: - type q6_K: 33 tensors llm_load_vocab: special tokens cache size = 259 llm_load_vocab: token to piece cache size = 0.1637 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 4.78 GiB (5.67 BPW) llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.2 llm_load_print_meta: BOS token = 1 '~~' llm_load_print_meta: EOS token = 2 '~~' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_print_meta: max token length = 48 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: Tesla V100-PCIE-16GB, compute capability 7.0, VMM: yes llm_load_tensors: ggml ctx size = 0.27 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 85.94 MiB llm_load_tensors: CUDA0 buffer size = 4807.05 MiB ................................................................................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 512.00 MiB llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.12 MiB llama_new_context_with_model: CUDA0 compute buffer size = 296.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 16.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | Model metadata: {'tokenizer.chat_template': "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}", 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.rope.freq_base': '1000000.000000', 'llama.context_length': '32768', 'general.name': 'mistralai_mistral-7b-instruct-v0.2', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '17'} Available chat formats from metadata: chat_template.default Guessed chat format: mistral-instruct INFO: Started server process [3357] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

canoalberto commented 1 month ago

I am having the same issue.

The fact that it complains the kernel wasn't compiled for an arch version (700) that is later listed in the compiled list is very LoL.

mmq.cuh:2422: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 700. ggml-cuda.cu was compiled for: 500,520,530,600,610,620,700,720,750,800,860,870,890,900

Environment: Rocky Linux 9.3 Python 3.11 NVIDIA 545 CUDA 12.3 V100 32GB (CC 7.0 -> arch 700)

shaunck96 commented 2 weeks ago

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+""" . llamacpp python version >0.2.85 to leverage llama 3.1. Llama quant version initialization: """ llm = Llama.from_pretrained( repo_id="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF", filename="Meta-Llama-3.1-8B-Instruct-Q4_K_L.gguf", n_ctx=4096, n_gpu_layers=-1
) """. """ ggml_cuda_compute_forward: ROPE failed CUDA error: unspecified launch failure current device: 0, in function ggml_cuda_compute_forward at /home/runner/work/llama-cpp-python/llama-cpp-python/vendor/llama.cpp/ggml/src/ggml-cuda.cu:2313 err /home/runner/work/llama-cpp-python/llama-cpp-python/vendor/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 700. ggml-cuda.cu was compiled for: 500,520,530,600,610,620,700,720,750,800,860,870,890,900 /home/runner/work/llama-cpp-python/llama-cpp-python/vendor/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 700. ggml-cuda.cu was compiled for: 500,520,530,600,610,620,700,720,750,800,860,870,890,900""". ROPE is failing. For large token count inputs, the above error occurs and inferencing occurs on CPU only. Please advice on resolution

abetlen / llama-cpp-python

CUDA error: unspecified launch failure on inference on Nvidia V100 GPUs #1624

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)