SciSharp / LLamaSharp

A C#/.NET library to run LLM (🦙LLaMA/LLaVA) on your local device efficiently.
https://scisharp.github.io/LLamaSharp
MIT License
2.69k stars 347 forks source link

Have an error then try run example "KernelMemorySaveAndLoad" #768

Closed cm4ker closed 5 months ago

cm4ker commented 5 months ago

Description

Hi! I'm trying to run the KernelMemorySaveAndLoad example and faced with issue. This is a console log output:

C:/projects/blazor-server-side/Test/SemanticKernelTest/bin/Debug/net8.0/SemanticKernelTest.exe

This program uses the Microsoft.KernelMemory package to ingest documents
and store the embeddings as local files so they can be quickly recalled
when this application is launched again.

Please input your model path (or ENTER for default):
(C:\test\llm\all-MiniLM-L12-v2.Q8_0.gguf): C:\test\llm\all-MiniLM-L12-v2.Q8_0.gguf
Kernel memory folder: C:\projects\blazor-server-side\Test\SemanticKernelTest\bin\Debug\net8.0\storage-KernelMemorySaveAndLoad
llama_model_loader: loaded meta data with 24 key-value pairs and 197 tensors from C:\test\llm\all-MiniLM-L12-v2.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.name str              = all-MiniLM-L12-v2
llama_model_loader: - kv   2:                           bert.block_count u32              = 12
llama_model_loader: - kv   3:                        bert.context_length u32              = 512
llama_model_loader: - kv   4:                      bert.embedding_length u32              = 384
llama_model_loader: - kv   5:                   bert.feed_forward_length u32              = 1536
llama_model_loader: - kv   6:                  bert.attention.head_count u32              = 12
llama_model_loader: - kv   7:          bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 7
llama_model_loader: - kv   9:                      bert.attention.causal bool             = false
llama_model_loader: - kv  10:                          bert.pooling_type u32              = 1
llama_model_loader: - kv  11:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  19:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  21:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  22:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  123 tensors
llama_model_loader: - type  f16:    1 tensors
llama_model_loader: - type q8_0:   73 tensors
llm_load_vocab: mismatch in special tokens definition ( 7104/30522 vs 5/30522 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 30522
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 512
llm_load_print_meta: n_embd           = 384
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 12
llm_load_print_meta: n_layer          = 12
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_embd_head_k    = 32
llm_load_print_meta: n_embd_head_v    = 32
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 384
llm_load_print_meta: n_embd_v_gqa     = 384
llm_load_print_meta: f_norm_eps       = 1.0e-12
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 1536
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 0
llm_load_print_meta: pooling type     = 1
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 512
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 33M
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 33.21 M
llm_load_print_meta: model size       = 34.00 MiB (8.59 BPW)
llm_load_print_meta: general.name     = all-MiniLM-L12-v2
llm_load_print_meta: BOS token        = 101 '[CLS]'
llm_load_print_meta: EOS token        = 102 '[SEP]'
llm_load_print_meta: UNK token        = 100 '[UNK]'
llm_load_print_meta: SEP token        = 102 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_print_meta: CLS token        = 101 '[CLS]'
llm_load_print_meta: MASK token       = 103 '[MASK]'
llm_load_print_meta: LF token         = 0 '[PAD]'
llm_load_tensors: ggml ctx size =    0.09 MiB
llm_load_tensors:        CPU buffer size =    34.00 MiB
.................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    36.00 MiB
llama_new_context_with_model: KV self size  =   36.00 MiB, K (f16):   18.00 MiB, V (f16):   18.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    17.00 MiB
llama_new_context_with_model: graph nodes  = 431
llama_new_context_with_model: graph splits = 1
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    36.00 MiB
llama_new_context_with_model: KV self size  =   36.00 MiB, K (f16):   18.00 MiB, V (f16):   18.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    17.00 MiB
llama_new_context_with_model: graph nodes  = 431
llama_new_context_with_model: graph splits = 1
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    36.00 MiB
llama_new_context_with_model: KV self size  =   36.00 MiB, K (f16):   18.00 MiB, V (f16):   18.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    17.00 MiB
llama_new_context_with_model: graph nodes  = 431
llama_new_context_with_model: graph splits = 1

Existing kernel memory was not found.
Documents will be analyzed (slow) and information saved to disk.
Analysis will not be required the next time this program is run.
Press ENTER to proceed...

Importing 1 of 2: C:\projects\blazor-server-side\Test\SemanticKernelTest\bin\Debug\net8.0\Assets\sample-SK-Readme.pdf
Completed in 00:00:01.7413797

Importing 2 of 2: C:\projects\blazor-server-side\Test\SemanticKernelTest\bin\Debug\net8.0\Assets\sample-KM-Readme.pdf
Completed in 00:00:00.4631057

Question: What formats does KM support?
Generating answer...
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    36.00 MiB
llama_new_context_with_model: KV self size  =   36.00 MiB, K (f16):   18.00 MiB, V (f16):   18.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    17.00 MiB
llama_new_context_with_model: graph nodes  = 431
llama_new_context_with_model: graph splits = 1
llama_get_logits_ith: invalid logits id 324, reason: no logits
Unhandled exception. System.NullReferenceException: Object reference not set to an instance of an object.
   at LLama.LLamaContext.ApplyPenalty(Int32 logits_i, IEnumerable`1 lastTokens, Dictionary`2 logitBias, Int32 repeatLastTokensCount, Single repeatPenalty, Single alphaFrequency, Single alphaPresence, Boolean penalizeNL)
   at LLama.StatelessExecutor.InferAsync(String prompt, IInferenceParams inferenceParams, CancellationToken cancellationToken)+MoveNext()
   at LLama.StatelessExecutor.InferAsync(String prompt, IInferenceParams inferenceParams, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()   
   at Microsoft.KernelMemory.Search.SearchClient.AskAsync(String index, String question, ICollection`1 filters, Double minRelevance, CancellationToken cancellationToken)
   at Microsoft.KernelMemory.Search.SearchClient.AskAsync(String index, String question, ICollection`1 filters, Double minRelevance, CancellationToken cancellationToken)
   at SemanticKernelTest.KernelMemorySaveAndLoad.ShowAnswer(IKernelMemory memory, String question) in C:\projects\blazor-server-side\Test\SemanticKernelTest\KernelMemorySaveAndLoad.cs:line 150
   at SemanticKernelTest.KernelMemorySaveAndLoad.AskSingleQuestion(IKernelMemory memory, String question) in C:\projects\blazor-server-side\Test\SemanticKernelTest\KernelMemorySaveAndLoad.cs:line 109      
   at SemanticKernelTest.KernelMemorySaveAndLoad.Run() in C:\projects\blazor-server-side\Test\SemanticKernelTest\KernelMemorySaveAndLoad.cs:line 57
   at SemanticKernelTest.Program.Main() in C:\projects\blazor-server-side\Test\SemanticKernelTest\Program.cs:line 8
   at SemanticKernelTest.Program.<Main>()

Process finished with exit code -532,462,766.

Installed packages:

        <PackageReference Include="LLamaSharp" Version="0.12.0"/>
        <PackageReference Include="LLamaSharp.Backend.Cpu" Version="0.12.0"/>
        <PackageReference Include="LLamaSharp.kernel-memory" Version="0.12.0"/>
        <PackageReference Include="LLamaSharp.semantic-kernel" Version="0.12.0"/>
        <PackageReference Include="Microsoft.KernelMemory.Core" Version="0.61.240524.1"/>
        <PackageReference Include="Microsoft.SemanticKernel" Version="1.13.0"/>
        <PackageReference Include="Microsoft.SemanticKernel.Plugins.Memory" Version="1.6.2-alpha"/>
        <PackageReference Include="Spectre.Console" Version="0.49.1"/>
cm4ker commented 5 months ago

Solved by change model. I think in model I have used was a embedding layer but not layer for generate text