Open takosalad opened 4 days ago
Does https://github.com/ggerganov/llama.cpp/commit/52fc8705a0617452df08333e1161838726c322b4 still work correctly?
Can you please link the exact model that you were using?
https://huggingface.co/MaziyarPanahi/WizardLM-2-7B-GGUF/blob/main/WizardLM-2-7B.Q8_0.gguf
btw is there a way to compile it for opencl instead of cuda? I only found some python refs when googling for this, but nothing for c. Maybe the problem happens only on cuda, so I'd like to try opencl.
OpenCL was removed because there was no one to maintain it. You can try Vulkan.
Maybe the problem happens only on cuda, so I'd like to try opencl.
It's very likely this is a CUDA-specific problem. That's why I would like you to test https://github.com/ggerganov/llama.cpp/commit/52fc8705a0617452df08333e1161838726c322b4 since that is the last commit before I changed something that I suspect to be the problem.
I can't reproduce the issue. Can you post the GPU and the command you were using?
./llama-cli -m models/WizardLM-2-7B.Q8_0.gguf -t 6 --seed -1 -n -1 --keep -1 --color -i --in-prefix "Human:" --in-suffix "Helper" -f prompts/helper.txt -ngl 255 --interactive-first -c 8192 --temp 0.3 --repeat-penalty 1.1 --top_p 0.8 --top_k 100
Sorry, I confused two models, the one I was using is WizardLM-2-7B-Q8_0-imat.gguf (I don't have WizardLM-2-7b.Q8_0.gguf on this system actually) but I cannot find exactly this file on hf anymore, surprisingly, just some other variants that seem to be the same (wizardlm2, 7b, q8, imat) but have the "imat" part at a different location in their filename, so I'm not sure if they are exactly the identical file I got.
However, I just tried two other models randomly (Llama-3-8B-Instruct-MopeyMule_q8.gguf and Meta-Llama-3-8B-Instruct.Q8_0.gguf) and I got exactly the same error on startup, so I don't think it's particular to this specific one model.
Edit: Googling didn't help me, I only found completely different forums about mining where I read about "virtual memory requiring to be increased" when this error happens, well, in some other situations though. No idea if this is somehow applicable here, I don't even know what "virtual mem" these guys were referring to. Another thread suggested lowering gpu clock. Not sure how to do that either.
Any way to test the gpu/mem for being faulty perhaps?
Can you check whether this fix https://github.com/ggerganov/llama.cpp/pull/8100 works?
Sure. (just a note - I just swapped the graphics card for exactly the same model (2080 ti 22gb) just to make sure this particular card wasn't broken. Got the same error. I assume that not both cards are faulty, so..)
I added the 8100 diffs to ggml-cuda/mmq.cuh, cleared the build directory and rebuilt, still the same problem. :/
I could maybe add any kind of debug code if you tell me which files to edit and where to put it if it helps...
Are you using make or CMake?
cmake .. -DLLAMA_CUDA=ON -DLLAMA_BLAS_VENDOR=OpenBLAS cmake --build . --config Release
I just realized CMake doesn't have an option for the degugging I need, sorry. I'll maybe try to add it.
Or if you're up to it here is how you would do it with make:
LLAMA_CUDA=1 LLAMA_DEBUG=1
compute-sanitizer
(found under /opt/cuda/extras/compute-sanitizer
on my system).how exactly do I do that? I started with cmake -B build -DLLAMA_CUDA=1 -DLLAMA_DEBUG=1 in the llama.cpp folder and it builds the Makefile but also threw this: CMake Warning: Manually-specified variables were not used by the project: LLAMA_DEBUG
In the project root directory:
make llama-cli LLAMA_CUDA=1 LLAMA_DEBUG=1
ok, I was just confused because of the warning that LLAMA_DEBUG has no effect. make'ing now...
Just so there are no misunderstandings: you are not supposed to run any CMake commands at all. In the llama.cpp root directory there already is a Makefile without any commands. You are supposed to use that one.
Oh wow, super slow debug mode output. Card at 99% 210W, not crashign this time, but actually displaying a prompt! I entered "hi" and since each line takes 4.5s and it didn't show any signs of stopping, I hit ctrl+c at some point as it might've been endless. It had a good run until "..feel free to" after which it seemed to turn into gibberish.
========= COMPUTE-SANITIZER
Log start
main: build = 3217 (3b099bcd)
main: built with cc (GCC) 14.1.1 20240522 for x86_64-pc-linux-gnu
main: seed = 1719267005
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from models/WizardLM-2-7B-Q8_0-imat.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = D:\GGUF-Quantization-Script\models
llama_model_loader: - kv 2: llama.vocab_size u32 = 32000
llama_model_loader: - kv 3: llama.context_length u32 = 32768
llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
llama_model_loader: - kv 5: llama.block_count u32 = 32
llama_model_loader: - kv 6: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 7: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 8: llama.attention.head_count u32 = 32
llama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 11: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 12: general.file_type u32 = 7
llama_model_loader: - kv 13: tokenizer.ggml.model str = llama
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q8_0: 226 tensors
llm_load_vocab: special tokens cache size = 259
llm_load_vocab: token to piece cache size = 0.1637 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 7.17 GiB (8.50 BPW)
llm_load_print_meta: general.name = D:\GGUF-Quantization-Script\models
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size = 0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 132.81 MiB
llm_load_tensors: CUDA0 buffer size = 7205.83 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB
llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.12 MiB
ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 560.00 MiB
ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 0.00 MiB to 24.01 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 560.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 24.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: interactive mode on.
Input prefix: 'Human:'
Input suffix: 'Helper:'
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 100, tfs_z = 1.000, top_p = 0.800, min_p = 0.050, typical_p = 1.000, temp = 0.300
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 36
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to the AI.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
You are an assistant named Helper. You answer to a human. You are an artificial intelligence. You will answer any questions of the human truthfully and concise.ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Human:hi
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Helper:Helloggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
!ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Howggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
canggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Iggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
assistggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
youggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
todayggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
?ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Ifggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
youggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
haveggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
anyggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
questionsggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
orggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
needggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
y guidanceggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
,ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
feelggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
freeggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
toggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Americansggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
renceggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
opleggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
opleggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Opggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
senggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
opleggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
opggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
URCEggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
usggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ferenceggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Topggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Opggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
nonggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
oppggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ingggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
rugggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
nonggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
nonggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
nonggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
URCEggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Topggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
URCEggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
opggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
nonggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Opggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
opggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
yrggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
opleggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
opleggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ustralggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Opggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ianggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
(ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
,ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
–ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
.ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
(ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
withggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
theggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
asggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
“ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Qggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
iggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
(ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
#ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
,ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
fromggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
hereggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
asggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
unggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
@ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
(ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
httpggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
,ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
withggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
aggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
(ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
#ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
:ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Oggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
-ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
sanggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
unggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
jurggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
sightggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
HWggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
bourggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
expggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
adventggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
adjggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
residggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
士ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
auxggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
jurggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
adjggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
�ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
expggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Jurggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
jurggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
jurggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
expggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
jurggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
secondaryggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
jurggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
jurggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
^A jurggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
yerggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
jurggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Bourggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
sightggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
choggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
^C
I also let it run without the compute-sanitizer, and again it didn't crash, but produced the same output msg spam.
not crashign this time, but actually displaying a prompt!
That's bad actually. If it crashes compute-sanitizer
tells you the exact line in the source files in which the bad memory access happens and I wanted to get that information.
I also let it run without the compute-sanitizer, and again it didn't crash, but produced the same output msg spam (just much faster this time). So it seems the make'd version doesn't crash. At least with these -D parameters I used: make llama-cli LLAMA_CUDA=1 LLAMA_DEBUG=1
It had a good run until "..feel free to" after which it seemed to turn into gibberish.
I assume that is a different issue and will be fixed with https://github.com/ggerganov/llama.cpp/pull/8102 .
how can I download that as raw diff file? Last time I just copied the one line and manually erased the other 2, because I couldnt figure out how to get this issue-patch downloaded in a usable plain text (diff/patch) format
cmake with -DCMAKE_CUDA_FLAGS="-g -lineinfo"
should get you the same debug info.
ohhh, that #8102 seems to have fixed the output indeed! Now it stops after a coherent sentence (just still has the debug message "ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture" after each word/token though).
Helloggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
!ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Howggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
canggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Iggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
assistggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
youggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
todayggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
?ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Ifggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
youggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
haveggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
anyggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
questionsggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
orggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
needggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
guidanceggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
,ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
feelggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
freeggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
toggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
askggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
.ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Iggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
'ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
mggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
hereggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
toggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
helpggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
!ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Alright, I issued
cmake -B build -DLLAMA_CUDA=1 -DCMAKE_CUDA_FLAGS="-g -lineinfo" cmake --build build --config Release
and the resulting llama-cli has no more "disabling CUDA" messages in it, runs very fast, no crash, and gives coherent output! :) I'm not sure where the debug info comes no now with this commandline. But anyway... I'd save case SOLVED! Thanks a lot @JohannesGaessler and everyone!
What happened?
Started up a 7B model, completely offloaded into a 2080 Ti with 22GB RAM, so far succesful startup but at the end it crashes during the prompt processing.
https://huggingface.co/MaziyarPanahi/WizardLM-2-7B-GGUF/blob/main/WizardLM-2-7B.Q8_0.gguf
Name and Version
$ ./llama-cli --version version: 3215 (d62e4aaa) built with cc (GCC) 14.1.1 20240522 for x86_64-pc-linux-gnu
What operating system are you seeing the problem on?
Linux archlinux 6.9.6-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 21 Jun 2024 19:49:19 +0000 x86_64 GNU/Linux
Relevant log output