ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.66k stars 9.42k forks source link

Embedding fails to run on vulkan backend #7130

Closed Adriankhl closed 4 months ago

Adriankhl commented 4 months ago

System information: Windows 11, cpu amd 7840u with 780m apu

Vulkan build: cmake .. -GNinja -DCMAKE_C_COMPILER=clang-cl -DCMAKE_CXX_COMPILER=clang-cl -DCMAKE_EXPORT_COMPILE_COMMANDS=1 -DLLAMA_VULKAN=1 -DLLAMA_NATIVE=OFF -DCMAKE_BUILD_TYPE=Release CPU build: cmake .. -GNinja -DCMAKE_C_COMPILER=clang-cl -DCMAKE_CXX_COMPILER=clang-cl -DCMAKE_EXPORT_COMPILE_COMMANDS=1 -DLLAMA_NATIVE=OFF -DCMAKE_BUILD_TYPE=Release

Model: https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/tree/main

I think something is wrong with the support of embedding models.

Observations:

  1. main runs fine on vulkan backend, with a normal LLM model such as llama 3
  2. embedding works on CPU backend with embedding models such as All-MiniLM
  3. embedding "works" on vulkan backend with a normal LLM model such as llama 3, though the output is not meaningful
  4. embedding fails to run on CPU backend with the following log with embedding models such as All-MiniLM
    main: build = 2794 (628b2991)
    main: built with Clang 18.1.4 for
    main: seed  = 1715115389
    llama_model_loader: loaded meta data with 24 key-value pairs and 101 tensors from ..\..\..\models\all-MiniLM-L6-v2-Q5_K_M.gguf (version GGUF V3 (latest))
    llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
    llama_model_loader: - kv   0:                       general.architecture str              = bert
    llama_model_loader: - kv   1:                               general.name str              = all-MiniLM-L6-v2
    llama_model_loader: - kv   2:                           bert.block_count u32              = 6
    llama_model_loader: - kv   3:                        bert.context_length u32              = 512
    llama_model_loader: - kv   4:                      bert.embedding_length u32              = 384
    llama_model_loader: - kv   5:                   bert.feed_forward_length u32              = 1536
    llama_model_loader: - kv   6:                  bert.attention.head_count u32              = 12
    llama_model_loader: - kv   7:          bert.attention.layer_norm_epsilon f32              = 0.000000
    llama_model_loader: - kv   8:                          general.file_type u32              = 17
    llama_model_loader: - kv   9:                      bert.attention.causal bool             = false
    llama_model_loader: - kv  10:                          bert.pooling_type u32              = 1
    llama_model_loader: - kv  11:            tokenizer.ggml.token_type_count u32              = 2
    llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 101
    llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 102
    llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = bert
    llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
    llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
    llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
    llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 100
    llama_model_loader: - kv  19:          tokenizer.ggml.seperator_token_id u32              = 102
    llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
    llama_model_loader: - kv  21:                tokenizer.ggml.cls_token_id u32              = 101
    llama_model_loader: - kv  22:               tokenizer.ggml.mask_token_id u32              = 103
    llama_model_loader: - kv  23:               general.quantization_version u32              = 2
    llama_model_loader: - type  f32:   63 tensors
    llama_model_loader: - type  f16:    1 tensors
    llama_model_loader: - type q5_1:   28 tensors
    llama_model_loader: - type q8_0:    3 tensors
    llama_model_loader: - type q5_K:    4 tensors
    llama_model_loader: - type q6_K:    2 tensors
    llm_load_vocab: mismatch in special tokens definition ( 7104/30522 vs 5/30522 ).
    llm_load_print_meta: format           = GGUF V3 (latest)
    llm_load_print_meta: arch             = bert
    llm_load_print_meta: vocab type       = WPM
    llm_load_print_meta: n_vocab          = 30522
    llm_load_print_meta: n_merges         = 0
    llm_load_print_meta: n_ctx_train      = 512
    llm_load_print_meta: n_embd           = 384
    llm_load_print_meta: n_head           = 12
    llm_load_print_meta: n_head_kv        = 12
    llm_load_print_meta: n_layer          = 6
    llm_load_print_meta: n_rot            = 32
    llm_load_print_meta: n_embd_head_k    = 32
    llm_load_print_meta: n_embd_head_v    = 32
    llm_load_print_meta: n_gqa            = 1
    llm_load_print_meta: n_embd_k_gqa     = 384
    llm_load_print_meta: n_embd_v_gqa     = 384
    llm_load_print_meta: f_norm_eps       = 1.0e-12
    llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
    llm_load_print_meta: f_clamp_kqv      = 0.0e+00
    llm_load_print_meta: f_max_alibi_bias = 0.0e+00
    llm_load_print_meta: f_logit_scale    = 0.0e+00
    llm_load_print_meta: n_ff             = 1536
    llm_load_print_meta: n_expert         = 0
    llm_load_print_meta: n_expert_used    = 0
    llm_load_print_meta: causal attn      = 0
    llm_load_print_meta: pooling type     = 1
    llm_load_print_meta: rope type        = 2
    llm_load_print_meta: rope scaling     = linear
    llm_load_print_meta: freq_base_train  = 10000.0
    llm_load_print_meta: freq_scale_train = 1
    llm_load_print_meta: n_yarn_orig_ctx  = 512
    llm_load_print_meta: rope_finetuned   = unknown
    llm_load_print_meta: ssm_d_conv       = 0
    llm_load_print_meta: ssm_d_inner      = 0
    llm_load_print_meta: ssm_d_state      = 0
    llm_load_print_meta: ssm_dt_rank      = 0
    llm_load_print_meta: model type       = 22M
    llm_load_print_meta: model ftype      = Q5_K - Medium
    llm_load_print_meta: model params     = 22.57 M
    llm_load_print_meta: model size       = 19.99 MiB (7.43 BPW)
    llm_load_print_meta: general.name     = all-MiniLM-L6-v2
    llm_load_print_meta: BOS token        = 101 '[CLS]'
    llm_load_print_meta: EOS token        = 102 '[SEP]'
    llm_load_print_meta: UNK token        = 100 '[UNK]'
    llm_load_print_meta: SEP token        = 102 '[SEP]'
    llm_load_print_meta: PAD token        = 0 '[PAD]'
    llm_load_print_meta: CLS token        = 101 '[CLS]'
    llm_load_print_meta: MASK token       = 103 '[MASK]'
    llm_load_print_meta: LF token         = 0 '[PAD]'
    ggml_vulkan: Found 1 Vulkan devices:
    Vulkan0: AMD Radeon(TM) 780M | uma: 1 | fp16: 1 | warp size: 64
    llm_load_tensors: ggml ctx size =    0.05 MiB
    llm_load_tensors: offloading 0 repeating layers to GPU
    llm_load_tensors: offloaded 0/7 layers to GPU
    llm_load_tensors:        CPU buffer size =    19.99 MiB
    ............................
    llama_new_context_with_model: n_ctx      = 512
    llama_new_context_with_model: n_batch    = 2048
    llama_new_context_with_model: n_ubatch   = 2048
    llama_new_context_with_model: flash_attn = 0
    llama_new_context_with_model: freq_base  = 10000.0
    llama_new_context_with_model: freq_scale = 1
    llama_kv_cache_init:        CPU KV buffer size =     4.50 MiB
    llama_new_context_with_model: KV self size  =    4.50 MiB, K (f16):    2.25 MiB, V (f16):    2.25 MiB
    WARNING: failed to allocate 0.00 MB of pinned memory
    GGML_ASSERT: C:\Users\adriankhl\git\learn\llama.cpp\ggml-backend.c:100: base != NULL && "backend buffer base cannot be NULL"
Adriankhl commented 4 months ago

https://github.com/ggerganov/llama.cpp/blob/b6aa6702030320a3d5fbc2508307af0d7c947e40/llama.cpp#L11229

It happens right here

Adriankhl commented 4 months ago

Also reproducible using the exe from the release page

teleprint-me commented 4 months ago

Does the same issue happen with the server? Or is it just isolated to main?

Adriankhl commented 4 months ago

Does the same issue happen with the server? Or is it just isolated to main?

Same error when I run .\bin\server.exe -m ..\..\..\models\all-MiniLM-L6-v2-Q5_K_M.gguf --embedding

Adriankhl commented 4 months ago

Let me summarize the investigation so far

  1. malloc 0 size issue:

With my OS and PC setting, embedding computation always try to first allocate buffer with 0 size here:

https://github.com/ggerganov/llama.cpp/blob/b6aa6702030320a3d5fbc2508307af0d7c947e40/llama.cpp#L11222

Because of size += TENSOR_ALIGNMENT, size is always bigger than 0 for cpu backend (not sure if this is the correct behaviour though). So cpu backend can always allocate a buffer successsfully.

https://github.com/ggerganov/llama.cpp/blob/b228aba91ac2cd9eb90e9d423ba1d0d20e0117e2/ggml-backend.c#L625-L631

For vulkan backend, ptr is still nullptr here after ggml_vk_host_malloc if size is 0.

https://github.com/ggerganov/llama.cpp/blob/b228aba91ac2cd9eb90e9d423ba1d0d20e0117e2/ggml-vulkan.cpp#L6031-L6043

And because ggml_vk_host_malloc runs successfully, it doesn't throw an exception, which causes problem later on.

  1. I can "fix" the issue above by throwing an exception to fallback to cpu buffer
        ptr = ggml_vk_host_malloc(&vk_instance.contexts[0], size);
        if (ptr == nullptr) {
            throw vk::InitializationFailedError("Null Pointer");
        }

    Embedding works for a short prompt

    .\bin\embedding.exe -m ..\..\..\models\mxbai-embed-large-v1.Q5_K_M.gguf --log-disable -p "Good weather`nI love cat"
Log 1 main: build = 2864 (cbf75894) main: built with Clang 18.1.4 for main: seed = 1715575791 llama_model_loader: loaded meta data with 24 key-value pairs and 101 tensors from ..\..\..\models\all-MiniLM-L6-v2-Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.name str = all-MiniLM-L6-v2 llama_model_loader: - kv 2: bert.block_count u32 = 6 llama_model_loader: - kv 3: bert.context_length u32 = 512 llama_model_loader: - kv 4: bert.embedding_length u32 = 384 llama_model_loader: - kv 5: bert.feed_forward_length u32 = 1536 llama_model_loader: - kv 6: bert.attention.head_count u32 = 12 llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 17 llama_model_loader: - kv 9: bert.attention.causal bool = false llama_model_loader: - kv 10: bert.pooling_type u32 = 1 llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 14: tokenizer.ggml.model str = bert llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 19: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 22: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 63 tensors llama_model_loader: - type f16: 1 tensors llama_model_loader: - type q5_1: 28 tensors llama_model_loader: - type q8_0: 3 tensors llama_model_loader: - type q5_K: 4 tensors llama_model_loader: - type q6_K: 2 tensors llm_load_vocab: mismatch in special tokens definition ( 7104/30522 vs 5/30522 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bert llm_load_print_meta: vocab type = WPM llm_load_print_meta: n_vocab = 30522 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 512 llm_load_print_meta: n_embd = 384 llm_load_print_meta: n_head = 12 llm_load_print_meta: n_head_kv = 12 llm_load_print_meta: n_layer = 6 llm_load_print_meta: n_rot = 32 llm_load_print_meta: n_embd_head_k = 32 llm_load_print_meta: n_embd_head_v = 32 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 384 llm_load_print_meta: n_embd_v_gqa = 384 llm_load_print_meta: f_norm_eps = 1.0e-12 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 1536 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 0 llm_load_print_meta: pooling type = 1 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 512 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 22M llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 22.57 M llm_load_print_meta: model size = 19.99 MiB (7.43 BPW) llm_load_print_meta: general.name = all-MiniLM-L6-v2 llm_load_print_meta: BOS token = 101 '[CLS]' llm_load_print_meta: EOS token = 102 '[SEP]' llm_load_print_meta: UNK token = 100 '[UNK]' llm_load_print_meta: SEP token = 102 '[SEP]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_print_meta: CLS token = 101 '[CLS]' llm_load_print_meta: MASK token = 103 '[MASK]' llm_load_print_meta: LF token = 0 '[PAD]' ggml_vulkan: Found 1 Vulkan devices: Vulkan0: AMD Radeon(TM) 780M | uma: 1 | fp16: 1 | warp size: 64 llm_load_tensors: ggml ctx size = 0.05 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/7 layers to GPU llm_load_tensors: CPU buffer size = 19.99 MiB ............................ llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 2048 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 4.50 MiB llama_new_context_with_model: KV self size = 4.50 MiB, K (f16): 2.25 MiB, V (f16): 2.25 MiB WARNING: failed to allocate 0.00 MB of pinned memory ggml_vulkan: Failed to allocate pinned memory. ggml_vulkan: Null Pointer: ErrorInitializationFailed llama_new_context_with_model: CPU output buffer size = 0.00 MiB ggml_gallocr_reserve_n: reallocating Vulkan0 buffer from size 0.00 MiB to 16.86 MiB ggml_gallocr_reserve_n: reallocating Vulkan_Host buffer from size 0.00 MiB to 3.50 MiB llama_new_context_with_model: Vulkan0 compute buffer size = 16.86 MiB llama_new_context_with_model: Vulkan_Host compute buffer size = 3.50 MiB llama_new_context_with_model: graph nodes = 221 llama_new_context_with_model: graph splits = 100 ggml_gallocr_needs_realloc: graph has different number of nodes ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve ggml_backend_sched_alloc_splits: failed to allocate graph, reserving system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | batch_decode: n_tokens = 9, n_seq = 2 ggml_gallocr_needs_realloc: node node_0 is not valid ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve ggml_backend_sched_alloc_splits: failed to allocate graph, reserving embedding 0: -0.078424 0.061774 0.122099 0.071252 -0.013703 -0.013969 0.057376 -0.043510 -0.059822 0.018061 0.005385 -0.043010 0.038214 -0.014732 0.027173 -0.001804 embedding 1: 0.005265 -0.016769 0.052540 -0.024372 -0.062103 -0.001837 0.098836 0.026607 0.044697 0.020890 -0.045096 -0.030395 -0.035944 0.049458 0.016966 -0.003935 cosine similarity matrix: 1.00 0.22 0.22 1.00 llama_print_timings: load time = 104.76 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: prompt eval time = 42.91 ms / 9 tokens ( 4.77 ms per token, 209.73 tokens per second) llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: total time = 817.37 ms / 10 tokens

But it doesn't work for a longer prompt

.\bin\embedding.exe -m ..\..\..\models\all-MiniLM-L6-v2-Q5_K_M.gguf -p "Antibiotics are a type of medication used to treat bacterial infections. They work by either killing the bacteria or preventing them from reproducing, allowing the body's immune system to fight off the infection. Antibiotics are usually taken orally in the form of pills, capsules, or liquid solutions, or sometimes administered intravenously. They are not effective against viral infections, and using them inappropriately can lead to antibiotic resistance.`nI love cat"

For debug build, an MSVC runtime error shows up: "Expression: can't dereference invalidated vector iterator", this is an error specific to this case though, I think I have seen it when I run llama.cpp main debug build

Log 2 main: build = 2864 (cbf75894) main: built with Clang 18.1.4 for main: seed = 1715576013 llama_model_loader: loaded meta data with 24 key-value pairs and 101 tensors from ..\..\..\models\all-MiniLM-L6-v2-Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.name str = all-MiniLM-L6-v2 llama_model_loader: - kv 2: bert.block_count u32 = 6 llama_model_loader: - kv 3: bert.context_length u32 = 512 llama_model_loader: - kv 4: bert.embedding_length u32 = 384 llama_model_loader: - kv 5: bert.feed_forward_length u32 = 1536 llama_model_loader: - kv 6: bert.attention.head_count u32 = 12 llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 17 llama_model_loader: - kv 9: bert.attention.causal bool = false llama_model_loader: - kv 10: bert.pooling_type u32 = 1 llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 14: tokenizer.ggml.model str = bert llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 19: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 22: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 63 tensors llama_model_loader: - type f16: 1 tensors llama_model_loader: - type q5_1: 28 tensors llama_model_loader: - type q8_0: 3 tensors llama_model_loader: - type q5_K: 4 tensors llama_model_loader: - type q6_K: 2 tensors llm_load_vocab: mismatch in special tokens definition ( 7104/30522 vs 5/30522 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bert llm_load_print_meta: vocab type = WPM llm_load_print_meta: n_vocab = 30522 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 512 llm_load_print_meta: n_embd = 384 llm_load_print_meta: n_head = 12 llm_load_print_meta: n_head_kv = 12 llm_load_print_meta: n_layer = 6 llm_load_print_meta: n_rot = 32 llm_load_print_meta: n_embd_head_k = 32 llm_load_print_meta: n_embd_head_v = 32 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 384 llm_load_print_meta: n_embd_v_gqa = 384 llm_load_print_meta: f_norm_eps = 1.0e-12 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 1536 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 0 llm_load_print_meta: pooling type = 1 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 512 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 22M llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 22.57 M llm_load_print_meta: model size = 19.99 MiB (7.43 BPW) llm_load_print_meta: general.name = all-MiniLM-L6-v2 llm_load_print_meta: BOS token = 101 '[CLS]' llm_load_print_meta: EOS token = 102 '[SEP]' llm_load_print_meta: UNK token = 100 '[UNK]' llm_load_print_meta: SEP token = 102 '[SEP]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_print_meta: CLS token = 101 '[CLS]' llm_load_print_meta: MASK token = 103 '[MASK]' llm_load_print_meta: LF token = 0 '[PAD]' ggml_vulkan: Found 1 Vulkan devices: Vulkan0: AMD Radeon(TM) 780M | uma: 1 | fp16: 1 | warp size: 64 llm_load_tensors: ggml ctx size = 0.05 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/7 layers to GPU llm_load_tensors: CPU buffer size = 19.99 MiB ............................ llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 2048 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 4.50 MiB llama_new_context_with_model: KV self size = 4.50 MiB, K (f16): 2.25 MiB, V (f16): 2.25 MiB WARNING: failed to allocate 0.00 MB of pinned memory ggml_vulkan: Failed to allocate pinned memory. ggml_vulkan: Null Pointer: ErrorInitializationFailed llama_new_context_with_model: CPU output buffer size = 0.00 MiB ggml_gallocr_reserve_n: reallocating Vulkan0 buffer from size 0.00 MiB to 16.86 MiB ggml_gallocr_reserve_n: reallocating Vulkan_Host buffer from size 0.00 MiB to 3.50 MiB llama_new_context_with_model: Vulkan0 compute buffer size = 16.86 MiB llama_new_context_with_model: Vulkan_Host compute buffer size = 3.50 MiB llama_new_context_with_model: graph nodes = 221 llama_new_context_with_model: graph splits = 100 ggml_gallocr_needs_realloc: graph has different number of nodes ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve ggml_backend_sched_alloc_splits: failed to allocate graph, reserving system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | batch_decode: n_tokens = 94, n_seq = 2 ggml_gallocr_needs_realloc: graph has different number of nodes ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve ggml_backend_sched_alloc_splits: failed to allocate graph, reserving

For release build, here is the error on the terminal: GGML_ASSERT: C:\Users\adriankhl\git\develop\llama.cpp\ggml-vulkan.cpp:1913: src1_type == GGML_TYPE_F32

Log 3 main: build = 2864 (cbf75894) main: built with Clang 18.1.4 for main: seed = 1715576579 llama_model_loader: loaded meta data with 24 key-value pairs and 101 tensors from ..\..\..\models\all-MiniLM-L6-v2-Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.name str = all-MiniLM-L6-v2 llama_model_loader: - kv 2: bert.block_count u32 = 6 llama_model_loader: - kv 3: bert.context_length u32 = 512 llama_model_loader: - kv 4: bert.embedding_length u32 = 384 llama_model_loader: - kv 5: bert.feed_forward_length u32 = 1536 llama_model_loader: - kv 6: bert.attention.head_count u32 = 12 llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 17 llama_model_loader: - kv 9: bert.attention.causal bool = false llama_model_loader: - kv 10: bert.pooling_type u32 = 1 llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 14: tokenizer.ggml.model str = bert llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 19: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 22: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 63 tensors llama_model_loader: - type f16: 1 tensors llama_model_loader: - type q5_1: 28 tensors llama_model_loader: - type q8_0: 3 tensors llama_model_loader: - type q5_K: 4 tensors llama_model_loader: - type q6_K: 2 tensors llm_load_vocab: mismatch in special tokens definition ( 7104/30522 vs 5/30522 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bert llm_load_print_meta: vocab type = WPM llm_load_print_meta: n_vocab = 30522 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 512 llm_load_print_meta: n_embd = 384 llm_load_print_meta: n_head = 12 llm_load_print_meta: n_head_kv = 12 llm_load_print_meta: n_layer = 6 llm_load_print_meta: n_rot = 32 llm_load_print_meta: n_embd_head_k = 32 llm_load_print_meta: n_embd_head_v = 32 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 384 llm_load_print_meta: n_embd_v_gqa = 384 llm_load_print_meta: f_norm_eps = 1.0e-12 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 1536 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 0 llm_load_print_meta: pooling type = 1 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 512 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 22M llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 22.57 M llm_load_print_meta: model size = 19.99 MiB (7.43 BPW) llm_load_print_meta: general.name = all-MiniLM-L6-v2 llm_load_print_meta: BOS token = 101 '[CLS]' llm_load_print_meta: EOS token = 102 '[SEP]' llm_load_print_meta: UNK token = 100 '[UNK]' llm_load_print_meta: SEP token = 102 '[SEP]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_print_meta: CLS token = 101 '[CLS]' llm_load_print_meta: MASK token = 103 '[MASK]' llm_load_print_meta: LF token = 0 '[PAD]' ggml_vulkan: Found 1 Vulkan devices: Vulkan0: AMD Radeon(TM) 780M | uma: 1 | fp16: 1 | warp size: 64 llm_load_tensors: ggml ctx size = 0.05 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/7 layers to GPU llm_load_tensors: CPU buffer size = 19.99 MiB ............................ llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 2048 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 4.50 MiB llama_new_context_with_model: KV self size = 4.50 MiB, K (f16): 2.25 MiB, V (f16): 2.25 MiB WARNING: failed to allocate 0.00 MB of pinned memory ggml_vulkan: Failed to allocate pinned memory. ggml_vulkan: Null Pointer: ErrorInitializationFailed llama_new_context_with_model: CPU output buffer size = 0.00 MiB llama_new_context_with_model: Vulkan0 compute buffer size = 16.86 MiB llama_new_context_with_model: Vulkan_Host compute buffer size = 3.50 MiB llama_new_context_with_model: graph nodes = 221 llama_new_context_with_model: graph splits = 100 system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | batch_decode: n_tokens = 94, n_seq = 2 GGML_ASSERT: C:\Users\adriankhl\git\develop\llama.cpp\ggml-vulkan.cpp:1913: src1_type == GGML_TYPE_F32
0cc4m commented 4 months ago

Thank you for the detailed report and the investigation and apologies for not getting back to you sooner. I'll look into it and let you know what I find.

0cc4m commented 4 months ago

@Adriankhl Can you check whether #7360 fixes your issues?

Adriankhl commented 4 months ago

@0cc4m hi, if the prompt is long, I still get a similar VC++ error in debug build, in release build the run finish, but it gives nan vector:

main: build = 2923 (8dbde1f0)
main: built with Clang 18.1.4 for
main: seed  = 1716037243
llama_model_loader: loaded meta data with 24 key-value pairs and 197 tensors from ..\..\..\models\all-MiniLM-L12-v2.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.name str              = all-MiniLM-L12-v2
llama_model_loader: - kv   2:                           bert.block_count u32              = 12
llama_model_loader: - kv   3:                        bert.context_length u32              = 512
llama_model_loader: - kv   4:                      bert.embedding_length u32              = 384
llama_model_loader: - kv   5:                   bert.feed_forward_length u32              = 1536
llama_model_loader: - kv   6:                  bert.attention.head_count u32              = 12
llama_model_loader: - kv   7:          bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 17
llama_model_loader: - kv   9:                      bert.attention.causal bool             = false
llama_model_loader: - kv  10:                          bert.pooling_type u32              = 1
llama_model_loader: - kv  11:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  19:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  21:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  22:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  123 tensors
llama_model_loader: - type  f16:    1 tensors
llama_model_loader: - type q5_1:   54 tensors
llama_model_loader: - type q8_0:    7 tensors
llama_model_loader: - type q5_K:    6 tensors
llama_model_loader: - type q6_K:    6 tensors
llm_load_vocab: mismatch in special tokens definition ( 7104/30522 vs 5/30522 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 30522
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 512
llm_load_print_meta: n_embd           = 384
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 12
llm_load_print_meta: n_layer          = 12
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_embd_head_k    = 32
llm_load_print_meta: n_embd_head_v    = 32
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 384
llm_load_print_meta: n_embd_v_gqa     = 384
llm_load_print_meta: f_norm_eps       = 1.0e-12
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 1536
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 0
llm_load_print_meta: pooling type     = 1
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 512
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 33M
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 33.21 M
llm_load_print_meta: model size       = 27.96 MiB (7.06 BPW)
llm_load_print_meta: general.name     = all-MiniLM-L12-v2
llm_load_print_meta: BOS token        = 101 '[CLS]'
llm_load_print_meta: EOS token        = 102 '[SEP]'
llm_load_print_meta: UNK token        = 100 '[UNK]'
llm_load_print_meta: SEP token        = 102 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_print_meta: CLS token        = 101 '[CLS]'
llm_load_print_meta: MASK token       = 103 '[MASK]'
llm_load_print_meta: LF token         = 0 '[PAD]'
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon(TM) 780M | uma: 1 | fp16: 1 | warp size: 64
llm_load_tensors: ggml ctx size =    0.09 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/13 layers to GPU
llm_load_tensors:        CPU buffer size =    27.96 MiB
..............................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 2048
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =     9.00 MiB
llama_new_context_with_model: KV self size  =    9.00 MiB, K (f16):    4.50 MiB, V (f16):    4.50 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.00 MiB
llama_new_context_with_model:    Vulkan0 compute buffer size =    16.90 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =     3.50 MiB
llama_new_context_with_model: graph nodes  = 431
llama_new_context_with_model: graph splits = 196

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
batch_decode: n_tokens = 94, n_seq = 2

embedding 0: -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind)
embedding 1: -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind)

cosine similarity matrix:

-nan(ind) -nan(ind)
-nan(ind) -nan(ind)

llama_print_timings:        load time =     109.19 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =     175.69 ms /    94 tokens (    1.87 ms per token,   535.03 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =     178.55 ms /    95 tokens
Adriankhl commented 4 months ago

Another interesting observation, if I set -ngl to a large value, like 30, I get a non-nan vector, but the values look wrong:

.\bin\embedding.exe -m ..\..\..\models\all-MiniLM-L12-v2.Q5_K_M.gguf -p "Antibiotics are a type of medication used to treat bacterial infections. They work by either killing the bacteria or preventing them from reproducing, allowing the body's immune system to fight off the infection. Antibiotics are usually taken orally in the form of pills, capsules, or liquid solutions, or sometimes administered intravenously. They are not effective against viral infections, and using them inappropriately can lead to antibiotic resistance.`nI love cat" -ngl 15
main: build = 2923 (8dbde1f0)
main: built with Clang 18.1.4 for
main: seed  = 1716037401
llama_model_loader: loaded meta data with 24 key-value pairs and 197 tensors from ..\..\..\models\all-MiniLM-L12-v2.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.name str              = all-MiniLM-L12-v2
llama_model_loader: - kv   2:                           bert.block_count u32              = 12
llama_model_loader: - kv   3:                        bert.context_length u32              = 512
llama_model_loader: - kv   4:                      bert.embedding_length u32              = 384
llama_model_loader: - kv   5:                   bert.feed_forward_length u32              = 1536
llama_model_loader: - kv   6:                  bert.attention.head_count u32              = 12
llama_model_loader: - kv   7:          bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 17
llama_model_loader: - kv   9:                      bert.attention.causal bool             = false
llama_model_loader: - kv  10:                          bert.pooling_type u32              = 1
llama_model_loader: - kv  11:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  19:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  21:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  22:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  123 tensors
llama_model_loader: - type  f16:    1 tensors
llama_model_loader: - type q5_1:   54 tensors
llama_model_loader: - type q8_0:    7 tensors
llama_model_loader: - type q5_K:    6 tensors
llama_model_loader: - type q6_K:    6 tensors
llm_load_vocab: mismatch in special tokens definition ( 7104/30522 vs 5/30522 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 30522
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 512
llm_load_print_meta: n_embd           = 384
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 12
llm_load_print_meta: n_layer          = 12
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_embd_head_k    = 32
llm_load_print_meta: n_embd_head_v    = 32
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 384
llm_load_print_meta: n_embd_v_gqa     = 384
llm_load_print_meta: f_norm_eps       = 1.0e-12
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 1536
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 0
llm_load_print_meta: pooling type     = 1
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 512
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 33M
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 33.21 M
llm_load_print_meta: model size       = 27.96 MiB (7.06 BPW)
llm_load_print_meta: general.name     = all-MiniLM-L12-v2
llm_load_print_meta: BOS token        = 101 '[CLS]'
llm_load_print_meta: EOS token        = 102 '[SEP]'
llm_load_print_meta: UNK token        = 100 '[UNK]'
llm_load_print_meta: SEP token        = 102 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_print_meta: CLS token        = 101 '[CLS]'
llm_load_print_meta: MASK token       = 103 '[MASK]'
llm_load_print_meta: LF token         = 0 '[PAD]'
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon(TM) 780M | uma: 1 | fp16: 1 | warp size: 64
llm_load_tensors: ggml ctx size =    0.18 MiB
llm_load_tensors: offloading 12 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 13/13 layers to GPU
llm_load_tensors:        CPU buffer size =    12.25 MiB
llm_load_tensors:    Vulkan0 buffer size =    15.71 MiB
..............................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 2048
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:    Vulkan0 KV buffer size =     9.00 MiB
llama_new_context_with_model: KV self size  =    9.00 MiB, K (f16):    4.50 MiB, V (f16):    4.50 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.00 MiB
llama_new_context_with_model:    Vulkan0 compute buffer size =    17.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =     3.50 MiB
llama_new_context_with_model: graph nodes  = 431
llama_new_context_with_model: graph splits = 2

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
batch_decode: n_tokens = 94, n_seq = 2

embedding 0: -0.012196 -0.004382 -0.068307 -0.037080 -0.011837 -0.000040  0.017563  0.056701  0.020313  0.024539  0.021325  0.052445 -0.015451  0.103782 -0.079035 -0.015415
embedding 1:  0.007400 -0.090975  0.050916 -0.027982 -0.098207 -0.004653  0.129955  0.098967  0.052596  0.070817 -0.015492 -0.080207  0.057286 -0.007871 -0.026050  0.015976

cosine similarity matrix:

  1.00  -0.09
 -0.09   1.00

llama_print_timings:        load time =     177.59 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =      77.88 ms /    94 tokens (    0.83 ms per token,  1207.05 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =      81.94 ms /    95 tokens
0cc4m commented 4 months ago

I can see that NaN error, it only happens when no layers are offloaded. Otherwise it seems to work fine.

The NaNs only happen on certain hardware and are caused by some clean-up issue that shows up in the Vulkan validation layer. I'll try to fix that soon.

0cc4m commented 4 months ago

@Adriankhl I fixed the NaN issue on my end, can you try running #7360 again?

Adriankhl commented 4 months ago

@0cc4m seems working fine🎊I will do a bit more testing later on.

One additional problem, I have figured out the cause of the debug build error, it happens here: https://github.com/ggerganov/llama.cpp/blob/e23b974f4cf9270d05062d446f406e3ff55d9451/ggml-vulkan.cpp#L625-L646

Because of the MSVC bug, the vector size is detected wrongly in a debug build, even when ctx->seqs is of size 1, the iterator debugging feature of MSVC gets the size wrong and thought it is of size 0, which throw an exception. Can you add add_definitions(-D_ITERATOR_DEBUG_LEVEL=0) for MSVC build in the cmake file to fix this issue?

0cc4m commented 4 months ago

@0cc4m seems working fine🎊I will do a bit more testing later on.

Thank you for checking!

One additional problem, I have figured out the cause of the debug build error, it happens here:

https://github.com/ggerganov/llama.cpp/blob/e23b974f4cf9270d05062d446f406e3ff55d9451/ggml-vulkan.cpp#L625-L646

Because of the MSVC bug, the vector size is detected wrongly in a debug build, even when ctx->seqs is of size 1, the iterator debugging feature of MSVC gets the size wrong and thought it is of size 0, which throw an exception. Can you add add_definitions(-D_ITERATOR_DEBUG_LEVEL=0) for MSVC build in the cmake file to fix this issue?

I can't, sorry. I don't use Windows, so I wouldn't be able to verify that, and it's outside the scope of my PR. If you think it's a useful addition you can open a separate PR for it.

Adriankhl commented 4 months ago

Thanks for this, and it also fixes the gibberish problem I encountered when the generated text exceeds the context size.