Closed Adriankhl closed 4 months ago
Also reproducible using the exe from the release page
Does the same issue happen with the server? Or is it just isolated to main?
Does the same issue happen with the server? Or is it just isolated to main?
Same error when I run .\bin\server.exe -m ..\..\..\models\all-MiniLM-L6-v2-Q5_K_M.gguf --embedding
Let me summarize the investigation so far
malloc
0 size issue:With my OS and PC setting, embedding computation always try to first allocate buffer with 0 size here:
Because of size += TENSOR_ALIGNMENT
, size is always bigger than 0 for cpu backend (not sure if this is the correct behaviour though). So cpu backend can always allocate a buffer successsfully.
For vulkan backend, ptr
is still nullptr
here after ggml_vk_host_malloc
if size is 0.
And because ggml_vk_host_malloc
runs successfully, it doesn't throw an exception, which causes problem later on.
ptr = ggml_vk_host_malloc(&vk_instance.contexts[0], size);
if (ptr == nullptr) {
throw vk::InitializationFailedError("Null Pointer");
}
Embedding works for a short prompt
.\bin\embedding.exe -m ..\..\..\models\mxbai-embed-large-v1.Q5_K_M.gguf --log-disable -p "Good weather`nI love cat"
But it doesn't work for a longer prompt
.\bin\embedding.exe -m ..\..\..\models\all-MiniLM-L6-v2-Q5_K_M.gguf -p "Antibiotics are a type of medication used to treat bacterial infections. They work by either killing the bacteria or preventing them from reproducing, allowing the body's immune system to fight off the infection. Antibiotics are usually taken orally in the form of pills, capsules, or liquid solutions, or sometimes administered intravenously. They are not effective against viral infections, and using them inappropriately can lead to antibiotic resistance.`nI love cat"
For debug build, an MSVC runtime error shows up: "Expression: can't dereference invalidated vector iterator", this is an error specific to this case though, I think I have seen it when I run llama.cpp main debug build
For release build, here is the error on the terminal: GGML_ASSERT: C:\Users\adriankhl\git\develop\llama.cpp\ggml-vulkan.cpp:1913: src1_type == GGML_TYPE_F32
Thank you for the detailed report and the investigation and apologies for not getting back to you sooner. I'll look into it and let you know what I find.
@Adriankhl Can you check whether #7360 fixes your issues?
@0cc4m hi, if the prompt is long, I still get a similar VC++ error in debug build, in release build the run finish, but it gives nan vector:
main: build = 2923 (8dbde1f0)
main: built with Clang 18.1.4 for
main: seed = 1716037243
llama_model_loader: loaded meta data with 24 key-value pairs and 197 tensors from ..\..\..\models\all-MiniLM-L12-v2.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = bert
llama_model_loader: - kv 1: general.name str = all-MiniLM-L12-v2
llama_model_loader: - kv 2: bert.block_count u32 = 12
llama_model_loader: - kv 3: bert.context_length u32 = 512
llama_model_loader: - kv 4: bert.embedding_length u32 = 384
llama_model_loader: - kv 5: bert.feed_forward_length u32 = 1536
llama_model_loader: - kv 6: bert.attention.head_count u32 = 12
llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000
llama_model_loader: - kv 8: general.file_type u32 = 17
llama_model_loader: - kv 9: bert.attention.causal bool = false
llama_model_loader: - kv 10: bert.pooling_type u32 = 1
llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2
llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 101
llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 102
llama_model_loader: - kv 14: tokenizer.ggml.model str = bert
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 100
llama_model_loader: - kv 19: tokenizer.ggml.seperator_token_id u32 = 102
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 21: tokenizer.ggml.cls_token_id u32 = 101
llama_model_loader: - kv 22: tokenizer.ggml.mask_token_id u32 = 103
llama_model_loader: - kv 23: general.quantization_version u32 = 2
llama_model_loader: - type f32: 123 tensors
llama_model_loader: - type f16: 1 tensors
llama_model_loader: - type q5_1: 54 tensors
llama_model_loader: - type q8_0: 7 tensors
llama_model_loader: - type q5_K: 6 tensors
llama_model_loader: - type q6_K: 6 tensors
llm_load_vocab: mismatch in special tokens definition ( 7104/30522 vs 5/30522 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = bert
llm_load_print_meta: vocab type = WPM
llm_load_print_meta: n_vocab = 30522
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 512
llm_load_print_meta: n_embd = 384
llm_load_print_meta: n_head = 12
llm_load_print_meta: n_head_kv = 12
llm_load_print_meta: n_layer = 12
llm_load_print_meta: n_rot = 32
llm_load_print_meta: n_embd_head_k = 32
llm_load_print_meta: n_embd_head_v = 32
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 384
llm_load_print_meta: n_embd_v_gqa = 384
llm_load_print_meta: f_norm_eps = 1.0e-12
llm_load_print_meta: f_norm_rms_eps = 0.0e+00
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 1536
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 0
llm_load_print_meta: pooling type = 1
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 512
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 33M
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 33.21 M
llm_load_print_meta: model size = 27.96 MiB (7.06 BPW)
llm_load_print_meta: general.name = all-MiniLM-L12-v2
llm_load_print_meta: BOS token = 101 '[CLS]'
llm_load_print_meta: EOS token = 102 '[SEP]'
llm_load_print_meta: UNK token = 100 '[UNK]'
llm_load_print_meta: SEP token = 102 '[SEP]'
llm_load_print_meta: PAD token = 0 '[PAD]'
llm_load_print_meta: CLS token = 101 '[CLS]'
llm_load_print_meta: MASK token = 103 '[MASK]'
llm_load_print_meta: LF token = 0 '[PAD]'
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon(TM) 780M | uma: 1 | fp16: 1 | warp size: 64
llm_load_tensors: ggml ctx size = 0.09 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/13 layers to GPU
llm_load_tensors: CPU buffer size = 27.96 MiB
..............................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 2048
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 9.00 MiB
llama_new_context_with_model: KV self size = 9.00 MiB, K (f16): 4.50 MiB, V (f16): 4.50 MiB
llama_new_context_with_model: Vulkan_Host output buffer size = 0.00 MiB
llama_new_context_with_model: Vulkan0 compute buffer size = 16.90 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size = 3.50 MiB
llama_new_context_with_model: graph nodes = 431
llama_new_context_with_model: graph splits = 196
system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
batch_decode: n_tokens = 94, n_seq = 2
embedding 0: -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind)
embedding 1: -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind)
cosine similarity matrix:
-nan(ind) -nan(ind)
-nan(ind) -nan(ind)
llama_print_timings: load time = 109.19 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 175.69 ms / 94 tokens ( 1.87 ms per token, 535.03 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 178.55 ms / 95 tokens
Another interesting observation, if I set -ngl
to a large value, like 30, I get a non-nan vector, but the values look wrong:
.\bin\embedding.exe -m ..\..\..\models\all-MiniLM-L12-v2.Q5_K_M.gguf -p "Antibiotics are a type of medication used to treat bacterial infections. They work by either killing the bacteria or preventing them from reproducing, allowing the body's immune system to fight off the infection. Antibiotics are usually taken orally in the form of pills, capsules, or liquid solutions, or sometimes administered intravenously. They are not effective against viral infections, and using them inappropriately can lead to antibiotic resistance.`nI love cat" -ngl 15
main: build = 2923 (8dbde1f0)
main: built with Clang 18.1.4 for
main: seed = 1716037401
llama_model_loader: loaded meta data with 24 key-value pairs and 197 tensors from ..\..\..\models\all-MiniLM-L12-v2.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = bert
llama_model_loader: - kv 1: general.name str = all-MiniLM-L12-v2
llama_model_loader: - kv 2: bert.block_count u32 = 12
llama_model_loader: - kv 3: bert.context_length u32 = 512
llama_model_loader: - kv 4: bert.embedding_length u32 = 384
llama_model_loader: - kv 5: bert.feed_forward_length u32 = 1536
llama_model_loader: - kv 6: bert.attention.head_count u32 = 12
llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000
llama_model_loader: - kv 8: general.file_type u32 = 17
llama_model_loader: - kv 9: bert.attention.causal bool = false
llama_model_loader: - kv 10: bert.pooling_type u32 = 1
llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2
llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 101
llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 102
llama_model_loader: - kv 14: tokenizer.ggml.model str = bert
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 100
llama_model_loader: - kv 19: tokenizer.ggml.seperator_token_id u32 = 102
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 21: tokenizer.ggml.cls_token_id u32 = 101
llama_model_loader: - kv 22: tokenizer.ggml.mask_token_id u32 = 103
llama_model_loader: - kv 23: general.quantization_version u32 = 2
llama_model_loader: - type f32: 123 tensors
llama_model_loader: - type f16: 1 tensors
llama_model_loader: - type q5_1: 54 tensors
llama_model_loader: - type q8_0: 7 tensors
llama_model_loader: - type q5_K: 6 tensors
llama_model_loader: - type q6_K: 6 tensors
llm_load_vocab: mismatch in special tokens definition ( 7104/30522 vs 5/30522 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = bert
llm_load_print_meta: vocab type = WPM
llm_load_print_meta: n_vocab = 30522
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 512
llm_load_print_meta: n_embd = 384
llm_load_print_meta: n_head = 12
llm_load_print_meta: n_head_kv = 12
llm_load_print_meta: n_layer = 12
llm_load_print_meta: n_rot = 32
llm_load_print_meta: n_embd_head_k = 32
llm_load_print_meta: n_embd_head_v = 32
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 384
llm_load_print_meta: n_embd_v_gqa = 384
llm_load_print_meta: f_norm_eps = 1.0e-12
llm_load_print_meta: f_norm_rms_eps = 0.0e+00
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 1536
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 0
llm_load_print_meta: pooling type = 1
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 512
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 33M
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 33.21 M
llm_load_print_meta: model size = 27.96 MiB (7.06 BPW)
llm_load_print_meta: general.name = all-MiniLM-L12-v2
llm_load_print_meta: BOS token = 101 '[CLS]'
llm_load_print_meta: EOS token = 102 '[SEP]'
llm_load_print_meta: UNK token = 100 '[UNK]'
llm_load_print_meta: SEP token = 102 '[SEP]'
llm_load_print_meta: PAD token = 0 '[PAD]'
llm_load_print_meta: CLS token = 101 '[CLS]'
llm_load_print_meta: MASK token = 103 '[MASK]'
llm_load_print_meta: LF token = 0 '[PAD]'
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon(TM) 780M | uma: 1 | fp16: 1 | warp size: 64
llm_load_tensors: ggml ctx size = 0.18 MiB
llm_load_tensors: offloading 12 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 13/13 layers to GPU
llm_load_tensors: CPU buffer size = 12.25 MiB
llm_load_tensors: Vulkan0 buffer size = 15.71 MiB
..............................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 2048
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Vulkan0 KV buffer size = 9.00 MiB
llama_new_context_with_model: KV self size = 9.00 MiB, K (f16): 4.50 MiB, V (f16): 4.50 MiB
llama_new_context_with_model: Vulkan_Host output buffer size = 0.00 MiB
llama_new_context_with_model: Vulkan0 compute buffer size = 17.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size = 3.50 MiB
llama_new_context_with_model: graph nodes = 431
llama_new_context_with_model: graph splits = 2
system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
batch_decode: n_tokens = 94, n_seq = 2
embedding 0: -0.012196 -0.004382 -0.068307 -0.037080 -0.011837 -0.000040 0.017563 0.056701 0.020313 0.024539 0.021325 0.052445 -0.015451 0.103782 -0.079035 -0.015415
embedding 1: 0.007400 -0.090975 0.050916 -0.027982 -0.098207 -0.004653 0.129955 0.098967 0.052596 0.070817 -0.015492 -0.080207 0.057286 -0.007871 -0.026050 0.015976
cosine similarity matrix:
1.00 -0.09
-0.09 1.00
llama_print_timings: load time = 177.59 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 77.88 ms / 94 tokens ( 0.83 ms per token, 1207.05 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 81.94 ms / 95 tokens
I can see that NaN error, it only happens when no layers are offloaded. Otherwise it seems to work fine.
The NaNs only happen on certain hardware and are caused by some clean-up issue that shows up in the Vulkan validation layer. I'll try to fix that soon.
@Adriankhl I fixed the NaN issue on my end, can you try running #7360 again?
@0cc4m seems working fine🎊I will do a bit more testing later on.
One additional problem, I have figured out the cause of the debug build error, it happens here: https://github.com/ggerganov/llama.cpp/blob/e23b974f4cf9270d05062d446f406e3ff55d9451/ggml-vulkan.cpp#L625-L646
Because of the MSVC bug, the vector size is detected wrongly in a debug build, even when ctx->seqs
is of size 1, the iterator debugging feature of MSVC gets the size wrong and thought it is of size 0, which throw an exception. Can you add add_definitions(-D_ITERATOR_DEBUG_LEVEL=0)
for MSVC build in the cmake file to fix this issue?
@0cc4m seems working fine🎊I will do a bit more testing later on.
Thank you for checking!
One additional problem, I have figured out the cause of the debug build error, it happens here:
Because of the MSVC bug, the vector size is detected wrongly in a debug build, even when
ctx->seqs
is of size 1, the iterator debugging feature of MSVC gets the size wrong and thought it is of size 0, which throw an exception. Can you addadd_definitions(-D_ITERATOR_DEBUG_LEVEL=0)
for MSVC build in the cmake file to fix this issue?
I can't, sorry. I don't use Windows, so I wouldn't be able to verify that, and it's outside the scope of my PR. If you think it's a useful addition you can open a separate PR for it.
Thanks for this, and it also fixes the gibberish problem I encountered when the generated text exceeds the context size.
System information: Windows 11, cpu amd 7840u with 780m apu
Vulkan build:
cmake .. -GNinja -DCMAKE_C_COMPILER=clang-cl -DCMAKE_CXX_COMPILER=clang-cl -DCMAKE_EXPORT_COMPILE_COMMANDS=1 -DLLAMA_VULKAN=1 -DLLAMA_NATIVE=OFF -DCMAKE_BUILD_TYPE=Release
CPU build:cmake .. -GNinja -DCMAKE_C_COMPILER=clang-cl -DCMAKE_CXX_COMPILER=clang-cl -DCMAKE_EXPORT_COMPILE_COMMANDS=1 -DLLAMA_NATIVE=OFF -DCMAKE_BUILD_TYPE=Release
Model: https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/tree/main
I think something is wrong with the support of embedding models.
Observations:
main
runs fine on vulkan backend, with a normal LLM model such as llama 3embedding
works on CPU backend with embedding models such as All-MiniLMembedding
"works" on vulkan backend with a normal LLM model such as llama 3, though the output is not meaningfulembedding
fails to run on CPU backend with the following log with embedding models such as All-MiniLM