ggerganov / llama.cpp

LLM inference in C/C++
MIT License
60.95k stars 8.7k forks source link

Bug: Crashes at the end of startup during first prompt processing #8096

Open takosalad opened 4 days ago

takosalad commented 4 days ago

What happened?

Started up a 7B model, completely offloaded into a 2080 Ti with 22GB RAM, so far succesful startup but at the end it crashes during the prompt processing.

https://huggingface.co/MaziyarPanahi/WizardLM-2-7B-GGUF/blob/main/WizardLM-2-7B.Q8_0.gguf

Name and Version

$ ./llama-cli --version version: 3215 (d62e4aaa) built with cc (GCC) 14.1.1 20240522 for x86_64-pc-linux-gnu

What operating system are you seeing the problem on?

Linux archlinux 6.9.6-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 21 Jun 2024 19:49:19 +0000 x86_64 GNU/Linux

Relevant log output

Log start
main: build = 3215 (d62e4aaa)
main: built with cc (GCC) 14.1.1 20240522 for x86_64-pc-linux-gnu
main: seed  = 1719233921
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from models/WizardLM-2-7B-Q8_0-imat.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = D:\GGUF-Quantization-Script\models
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv   3:                       llama.context_length u32              = 32768
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 7
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens cache size = 259
llm_load_vocab: token to piece cache size = 0.1637 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 7.17 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = D:\GGUF-Quantization-Script\models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   132.81 MiB
llm_load_tensors:      CUDA0 buffer size =  7205.83 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   560.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2

system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
main: interactive mode on.
Input prefix: 'Human:'
Input suffix: 'Helper:'
sampling: 
    repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 100, tfs_z = 1.000, top_p = 0.800, min_p = 0.050, typical_p = 1.000, temp = 0.300
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 48

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 You are an assistant named Helper. You answer to a human. You are an artificial intelligence. You will answer any questions of the human truthfully and concise.CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_cuda_op_mul_mat at /home/ai/llama.cpp/ggml-cuda.cu:1606
  cudaGetLastError()
GGML_ASSERT: /home/ai/llama.cpp/ggml-cuda.cu:100: !"CUDA error"
ptrace: Operation not permitted.
No stack.
The program is not being run.
JohannesGaessler commented 4 days ago

Does https://github.com/ggerganov/llama.cpp/commit/52fc8705a0617452df08333e1161838726c322b4 still work correctly?

JohannesGaessler commented 4 days ago

Can you please link the exact model that you were using?

takosalad commented 4 days ago

https://huggingface.co/MaziyarPanahi/WizardLM-2-7B-GGUF/blob/main/WizardLM-2-7B.Q8_0.gguf

btw is there a way to compile it for opencl instead of cuda? I only found some python refs when googling for this, but nothing for c. Maybe the problem happens only on cuda, so I'd like to try opencl.

JohannesGaessler commented 4 days ago

OpenCL was removed because there was no one to maintain it. You can try Vulkan.

Maybe the problem happens only on cuda, so I'd like to try opencl.

It's very likely this is a CUDA-specific problem. That's why I would like you to test https://github.com/ggerganov/llama.cpp/commit/52fc8705a0617452df08333e1161838726c322b4 since that is the last commit before I changed something that I suspect to be the problem.

JohannesGaessler commented 4 days ago

I can't reproduce the issue. Can you post the GPU and the command you were using?

takosalad commented 4 days ago

./llama-cli -m models/WizardLM-2-7B.Q8_0.gguf -t 6 --seed -1 -n -1 --keep -1 --color -i --in-prefix "Human:" --in-suffix "Helper" -f prompts/helper.txt -ngl 255 --interactive-first -c 8192 --temp 0.3 --repeat-penalty 1.1 --top_p 0.8 --top_k 100

Sorry, I confused two models, the one I was using is WizardLM-2-7B-Q8_0-imat.gguf (I don't have WizardLM-2-7b.Q8_0.gguf on this system actually) but I cannot find exactly this file on hf anymore, surprisingly, just some other variants that seem to be the same (wizardlm2, 7b, q8, imat) but have the "imat" part at a different location in their filename, so I'm not sure if they are exactly the identical file I got.

However, I just tried two other models randomly (Llama-3-8B-Instruct-MopeyMule_q8.gguf and Meta-Llama-3-8B-Instruct.Q8_0.gguf) and I got exactly the same error on startup, so I don't think it's particular to this specific one model.

Edit: Googling didn't help me, I only found completely different forums about mining where I read about "virtual memory requiring to be increased" when this error happens, well, in some other situations though. No idea if this is somehow applicable here, I don't even know what "virtual mem" these guys were referring to. Another thread suggested lowering gpu clock. Not sure how to do that either.

Any way to test the gpu/mem for being faulty perhaps?

JohannesGaessler commented 4 days ago

Can you check whether this fix https://github.com/ggerganov/llama.cpp/pull/8100 works?

takosalad commented 4 days ago

Sure. (just a note - I just swapped the graphics card for exactly the same model (2080 ti 22gb) just to make sure this particular card wasn't broken. Got the same error. I assume that not both cards are faulty, so..)

I added the 8100 diffs to ggml-cuda/mmq.cuh, cleared the build directory and rebuilt, still the same problem. :/

I could maybe add any kind of debug code if you tell me which files to edit and where to put it if it helps...

JohannesGaessler commented 4 days ago

Are you using make or CMake?

takosalad commented 4 days ago

cmake .. -DLLAMA_CUDA=ON -DLLAMA_BLAS_VENDOR=OpenBLAS cmake --build . --config Release

JohannesGaessler commented 4 days ago

I just realized CMake doesn't have an option for the degugging I need, sorry. I'll maybe try to add it.

Or if you're up to it here is how you would do it with make:

  1. git checkout latest master commit
  2. Build with make and LLAMA_CUDA=1 LLAMA_DEBUG=1
  3. Prepend your command that causes an illegal memory access with compute-sanitizer (found under /opt/cuda/extras/compute-sanitizer on my system).
  4. Post the last few hundred lines of command line output.
takosalad commented 4 days ago

how exactly do I do that? I started with cmake -B build -DLLAMA_CUDA=1 -DLLAMA_DEBUG=1 in the llama.cpp folder and it builds the Makefile but also threw this: CMake Warning: Manually-specified variables were not used by the project: LLAMA_DEBUG

JohannesGaessler commented 4 days ago

In the project root directory:

make llama-cli LLAMA_CUDA=1 LLAMA_DEBUG=1
takosalad commented 4 days ago

ok, I was just confused because of the warning that LLAMA_DEBUG has no effect. make'ing now...

JohannesGaessler commented 4 days ago

Just so there are no misunderstandings: you are not supposed to run any CMake commands at all. In the llama.cpp root directory there already is a Makefile without any commands. You are supposed to use that one.

takosalad commented 4 days ago

Oh wow, super slow debug mode output. Card at 99% 210W, not crashign this time, but actually displaying a prompt! I entered "hi" and since each line takes 4.5s and it didn't show any signs of stopping, I hit ctrl+c at some point as it might've been endless. It had a good run until "..feel free to" after which it seemed to turn into gibberish.

========= COMPUTE-SANITIZER
Log start
main: build = 3217 (3b099bcd)
main: built with cc (GCC) 14.1.1 20240522 for x86_64-pc-linux-gnu
main: seed  = 1719267005
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from models/WizardLM-2-7B-Q8_0-imat.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = D:\GGUF-Quantization-Script\models
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv   3:                       llama.context_length u32              = 32768
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 7
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens cache size = 259
llm_load_vocab: token to piece cache size = 0.1637 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 7.17 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = D:\GGUF-Quantization-Script\models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   132.81 MiB
llm_load_tensors:      CUDA0 buffer size =  7205.83 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 560.00 MiB
ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 0.00 MiB to 24.01 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   560.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture

system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
main: interactive mode on.
Input prefix: 'Human:'
Input suffix: 'Helper:'
sampling: 
    repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 100, tfs_z = 1.000, top_p = 0.800, min_p = 0.050, typical_p = 1.000, temp = 0.300
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 36

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 You are an assistant named Helper. You answer to a human. You are an artificial intelligence. You will answer any questions of the human truthfully and concise.ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Human:hi
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Helper:Helloggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
!ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 Howggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 canggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 Iggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 assistggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 youggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 todayggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
?ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 Ifggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 youggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 haveggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 anyggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 questionsggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 orggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 needggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
y guidanceggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
,ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 feelggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 freeggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 toggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 Americansggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
renceggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
opleggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
opleggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Opggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
senggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
opleggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
opggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
URCEggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 usggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ferenceggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 Topggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 Opggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 nonggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
oppggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 ingggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 rugggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 nonggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 nonggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 nonggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
URCEggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 Topggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
URCEggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
opggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 nonggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 Opggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 opggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
yrggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
opleggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
opleggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ustralggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 Opggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ianggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 (ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
,ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 –ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
.ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 (ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 withggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 theggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 asggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 “ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
Qggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
iggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 (ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
#ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
,ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 fromggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 hereggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 asggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 unggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
@ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 (ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
httpggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
,ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 withggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 aggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 (ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
#ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
:ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 Oggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
-ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
sanggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
unggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 jurggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 sightggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
HWggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
bourggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 expggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 adventggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 adjggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 residggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
士ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
auxggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 jurggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 adjggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
�ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 expggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 Jurggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 jurggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 jurggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 expggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 jurggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 secondaryggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 jurggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 jurggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
^A jurggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
yerggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 jurggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 Bourggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 sightggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 choggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
^C

I also let it run without the compute-sanitizer, and again it didn't crash, but produced the same output msg spam.

JohannesGaessler commented 4 days ago

not crashign this time, but actually displaying a prompt!

That's bad actually. If it crashes compute-sanitizer tells you the exact line in the source files in which the bad memory access happens and I wanted to get that information.

takosalad commented 4 days ago

I also let it run without the compute-sanitizer, and again it didn't crash, but produced the same output msg spam (just much faster this time). So it seems the make'd version doesn't crash. At least with these -D parameters I used: make llama-cli LLAMA_CUDA=1 LLAMA_DEBUG=1

JohannesGaessler commented 4 days ago

It had a good run until "..feel free to" after which it seemed to turn into gibberish.

I assume that is a different issue and will be fixed with https://github.com/ggerganov/llama.cpp/pull/8102 .

takosalad commented 4 days ago

how can I download that as raw diff file? Last time I just copied the one line and manually erased the other 2, because I couldnt figure out how to get this issue-patch downloaded in a usable plain text (diff/patch) format

slaren commented 4 days ago

cmake with -DCMAKE_CUDA_FLAGS="-g -lineinfo" should get you the same debug info.

takosalad commented 4 days ago

ohhh, that #8102 seems to have fixed the output indeed! Now it stops after a coherent sentence (just still has the debug message "ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture" after each word/token though).

Helloggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
!ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 Howggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 canggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 Iggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 assistggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 youggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 todayggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
?ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 Ifggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 youggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 haveggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 anyggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 questionsggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 orggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 needggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 guidanceggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
,ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 feelggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 freeggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 toggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 askggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
.ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 Iggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
'ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
mggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 hereggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 toggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
 helpggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
!ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
takosalad commented 4 days ago

Alright, I issued

cmake -B build -DLLAMA_CUDA=1 -DCMAKE_CUDA_FLAGS="-g -lineinfo" cmake --build build --config Release

and the resulting llama-cli has no more "disabling CUDA" messages in it, runs very fast, no crash, and gives coherent output! :) I'm not sure where the debug info comes no now with this commandline. But anyway... I'd save case SOLVED! Thanks a lot @JohannesGaessler and everyone!