SciSharp / LLamaSharp

A C#/.NET library to run LLM (🦙LLaMA/LLaVA) on your local device efficiently.
https://scisharp.github.io/LLamaSharp
MIT License
2.46k stars 330 forks source link

Method not found: 'Double Microsoft.KernelMemory.AI.TextGenerationOptions.get_TopP()'. #832

Open KanonRim opened 2 months ago

KanonRim commented 2 months ago

Description

I get an error Method not found: 'Double Microsoft.KernelMemory.AI'

Reproduction Steps

repeating the example, except replacing the document with text https://github.com/SciSharp/LLamaSharp/blob/master/LLama.Examples/Examples/KernelMemory.cs

console: This program uses the Microsoft.KernelMemory package to ingest documents and answer questions about them in an interactive chat prompt.

llama_model_loader: loaded meta data with 27 key-value pairs and 291 tensors from Z:\download\Llama-3-Instruct-8B-SPPO-Iter3-Q3_K_L.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Llama-3-Instruct-8B-SPPO-Iter3 llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 13 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["─а ─а", "─а ─а─а─а", "─а─а ─а─а", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 128009 llama_model_loader: - kv 21: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 22: general.quantization_version u32 = 2 llama_model_loader: - kv 23: quantize.imatrix.file str = /models/Llama-3-Instruct-8B-SPPO-Iter... llama_model_loader: - kv 24: quantize.imatrix.dataset str = /training_data/calibration_datav3.txt llama_model_loader: - kv 25: quantize.imatrix.entries_count i32 = 224 llama_model_loader: - kv 26: quantize.imatrix.chunks_count i32 = 125 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q3_K: 129 tensors llama_model_loader: - type q5_K: 96 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 256/128256 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q3_K - Large llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.02 GiB (4.30 BPW) llm_load_print_meta: general.name = Llama-3-Instruct-8B-SPPO-Iter3 llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: PAD token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 '├Д' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1660, compute capability 7.5, VMM: yes llm_load_tensors: ggml ctx size = 0.30 MiB llm_load_tensors: offloading 20 repeating layers to GPU llm_load_tensors: offloaded 20/33 layers to GPU llm_load_tensors: CPU buffer size = 4114.27 MiB llm_load_tensors: CUDA0 buffer size = 2180.00 MiB ....................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 96.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 160.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 136 llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 96.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 160.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 136 llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 96.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 160.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 136 Question: What happened to the tomato disappeared on the International Space Station? Generating answer... Unhandled exception. System.MissingMethodException: Method not found: 'Double Microsoft.KernelMemory.AI.TextGenerationOptions.get_TopP()'. at LLamaSharp.KernelMemory.LlamaSharpTextGenerator.OptionsToParams(TextGenerationOptions options, InferenceParams defaultParams) at LLamaSharp.KernelMemory.LlamaSharpTextGenerator.GenerateTextAsync(String prompt, TextGenerationOptions options, CancellationToken cancellationToken) at Microsoft.KernelMemory.Search.SearchClient.GenerateAnswer(String question, String facts, IContext context, CancellationToken token) at Microsoft.KernelMemory.Search.SearchClient.AskAsync(String index, String question, ICollection`1 filters, Double minRelevance, IContext context, CancellationToken cancellationToken) at LLama.Examples.Examples.KernelMemory.AnswerQuestion(IKernelMemory memory, String question) in Z:\rep\MyAIHelper\MyMemory\KernelMemory.cs:line 105 at LLama.Examples.Examples.KernelMemory.Run() in Z:\rep\MyAIHelper\MyMemory\KernelMemory.cs:line 58 at Program.

$(String[] args) in Z:\rep\MyAIHelper\MyMemory\Program.cs:line 3 at Program.
(String[] args)

Environment & Configuration

  • Operating system: win 10
  • .NET runtime version: net 8
  • LLamaSharp version: 0.13.0
  • CUDA version (if you are using cuda backend): 12
  • CPU & GPU device: I5 4670 , GTX 1660 6g

Known Workarounds

No response

SignalRT commented 1 month ago

In this commit:

service/Abstractions/AI/TextGenerationOptions.cs

It seems that the property was renamed in Kernel Memory.

image

We need to change update Kernel Memory package version and fix the issue

ksnyder2024 commented 1 month ago

As a temporary workaround can I revert to an older Kernel Memory package version? If so - what is the package name and version to revert to? I'm blocked by this issue when I call 'MemoryAnswer answer = await memory.AskAsync(question);'

martindevans commented 1 month ago

I updated the version of KernelMemory last night in https://github.com/SciSharp/LLamaSharp/pull/841 from 0.34.240313.1 to 0.66.240709.1. So you could probably go all the way back to 0.34.240313.1 to get the old naming. Alternatively you can pull the current master branch to use the updated source.

ksnyder2024 commented 1 month ago

0.34.240313.1 worked like a charm - thx! Will update with the next build

ksnyder2024 commented 1 month ago

updated to LlamaSharp 0.14.0 and Microsoft.KernelMemory.Core 0.68.240716.1 and failed with different error

Unhandled exception. System.TypeLoadException: Method 'GetTokens' in type 'LLamaSharp.KernelMemory.LLamaSharpTextEmbeddingGenerator' from assembly 'LLamaSharp.KernelMemory, Version=0.14.0.0, Culture=neutral, PublicKeyToken=null' does not have an implementation. at LLamaSharp.KernelMemory.BuilderExtensions.WithLLamaSharpDefaults(IKernelMemoryBuilder builder, LLamaSharpConfig config, LLamaWeights weights, LLamaContext context)

martindevans commented 1 month ago

That's not the correct version, see here: https://github.com/SciSharp/LLamaSharp/blob/master/LLama.KernelMemory/LLamaSharp.KernelMemory.csproj#L30

ksnyder2024 commented 1 month ago

I used Nuget packages LlamaSharp v0.14.0 and Microsoft.KernelMemory.Core v0.66.240709.1 (which includes Abstractions) and received an Unhandled Exception. Apologies if I'm missing something

Unhandled exception. System.NullReferenceException: Object reference not set to an instance of an object.
   at LLama.LLamaContext.ApplyPenalty(Int32 logits_i, IEnumerable`1 lastTokens, Dictionary`2 logitBias, Int32 repeatLastTokensCount, Single repeatPenalty, Single alphaFrequency, Single alphaPresence, Boolean penalizeNL)
   at LLama.StatelessExecutor.InferAsync(String prompt, IInferenceParams inferenceParams, CancellationToken cancellationToken)+MoveNext()
   at LLama.StatelessExecutor.InferAsync(String prompt, IInferenceParams inferenceParams, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()
   at Microsoft.KernelMemory.Search.SearchClient.AskAsync(String index, String question, ICollection`1 filters, Double minRelevance, IContext context, CancellationToken cancellationToken)
   at Microsoft.KernelMemory.Search.SearchClient.AskAsync(String index, String question, ICollection`1 filters, Double minRelevance, IContext context, CancellationToken cancellationToken)
martindevans commented 1 month ago

That looks like it might be a bug in LLamaSharp. I'm not very familiar with the KernelMemory stuff, so I'm not sure if it's a bug in the way the KernelMemory integration is using LLamaSharp or if it's a bug in the core. Can you step into it and debug it any further to see what's null?

ksnyder2024 commented 1 month ago

Hi Martin,

Apologies - will be a little while before I get the chance to step into it (it wasn't straightforward to do, and I have some personal issues to resolve first).