SciSharp / LLamaSharp

A C#/.NET library to run LLM (🦙LLaMA/LLaVA) on your local device efficiently.
https://scisharp.github.io/LLamaSharp
MIT License
2.46k stars 330 forks source link

[BUG]: When the number of GpuLayerCount is more than 5, no data is returned or the speed is very slow #835

Open nazihaghighi opened 1 month ago

nazihaghighi commented 1 month ago

Description

Hi, I am using the latest version of LLamaSharp and my model is Llama-3 70b gguf version, when the number of GpuLayerCount is 0 to 5, although it is not very fast, I get the answer, but when I increase the number of GpuLayerCount, no answer is found anymore, as if no process is done. It uses gpu for 2-3 seconds and then uses cpu for 3-4 minutes to answer, is this process normal? Is there a way to increase the speed, or make it use more system resources to give the answer faster?

My hardware: Cpu: AMD 1920X 12 core Gpu: RTX 3060 12G Ram: 64G The graphic card driver is also installed and updated Cuda: 12.5 .net core 7

GpuLayerCount : 20 (no data is returned or the speed is very slow) image GpuLayerCount : 0 image

Console:

llama_model_loader: loaded meta data with 32 key-value pairs and 723 tensors from C:\model\Llama-3-Swallow-70B-v0.1.i1-Q4_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Llama-3-Swallow-70B-v0.1 llama_model_loader: - kv 2: llama.block_count u32 = 80 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 8192 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 6: llama.attention.head_count u32 = 64 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 15 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128001 llama_model_loader: - kv 20: general.quantization_version u32 = 2 llama_model_loader: - kv 21: general.url str = https://huggingface.co/mradermacher/L... llama_model_loader: - kv 22: mradermacher.quantize_version str = 2 llama_model_loader: - kv 23: mradermacher.quantized_by str = mradermacher llama_model_loader: - kv 24: mradermacher.quantized_at str = 2024-07-02T15:25:42+02:00 llama_model_loader: - kv 25: mradermacher.quantized_on str = db1 llama_model_loader: - kv 26: general.source.url str = https://huggingface.co/tokyotech-llm/... llama_model_loader: - kv 27: mradermacher.convert_type str = hf llama_model_loader: - kv 28: quantize.imatrix.file str = Llama-3-Swallow-70B-v0.1-i1-GGUF/imat... llama_model_loader: - kv 29: quantize.imatrix.dataset str = imatrix-training-full-2.txt llama_model_loader: - kv 30: quantize.imatrix.entries_count i32 = 560 llama_model_loader: - kv 31: quantize.imatrix.chunks_count i32 = 277 llama_model_loader: - type f32: 161 tensors llama_model_loader: - type q4_K: 441 tensors llama_model_loader: - type q5_K: 40 tensors llama_model_loader: - type q6_K: 81 tensors llm_load_vocab: special tokens definition check successful ( 256/128256 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 70.55 B llm_load_print_meta: model size = 39.59 GiB (4.82 BPW) llm_load_print_meta: general.name = Llama-3-Swallow-70B-v0.1 llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128001 '<|end_of_text|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.37 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/81 layers to GPU llm_load_tensors: CPU buffer size = 40543.11 MiB ................................................................................................... llama_new_context_with_model: n_ctx = 1024 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 320.00 MiB llama_new_context_with_model: KV self size = 320.00 MiB, K (f16): 160.00 MiB, V (f16): 160.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 18.01 MiB llama_new_context_with_model: graph nodes = 2566 llama_new_context_with_model: graph splits = 884

Reproduction Steps

`using LLama.Common; using LLama;

string modelPath = @"C:\model\Llama-3-Swallow-70B-v0.1.i1-Q4_K_M.gguf"; // change it to your own model path.

var parameters = new ModelParams(modelPath) { //FlashAttention = true, ContextSize = 1024, // The longest length of chat as memory. GpuLayerCount = 0 // How many layers to offload to GPU. Please adjust it according to your GPU memory. }; using var model = LLamaWeights.LoadFromFile(parameters); using var context = model.CreateContext(parameters); var executor = new InteractiveExecutor(context);

// Add chat histories as prompt to tell AI how to act. var chatHistory = new ChatHistory(); chatHistory.AddMessage(AuthorRole.System, "Transcript of a dialog, where the User interacts with an Assistant named Mehdi. Mehdi is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision."); chatHistory.AddMessage(AuthorRole.User, "Hello, Mehdi."); chatHistory.AddMessage(AuthorRole.Assistant, "Hello. How may I help you today?");

ChatSession session = new(executor, chatHistory);

InferenceParams inferenceParams = new InferenceParams() { MaxTokens = 256, // No more than 256 tokens should appear in answer. Remove it if antiprompt is enough for control. AntiPrompts = new List { "User:" } // Stop generation once antiprompts appear. };

Console.ForegroundColor = ConsoleColor.Yellow; Console.Write("The chat session has started.\nUser: "); Console.ForegroundColor = ConsoleColor.Green; string userInput = Console.ReadLine() ?? "";

while (userInput != "exit") { await foreach ( // Generate the response streamingly. var text in session.ChatAsync( new ChatHistory.Message(AuthorRole.User, userInput), inferenceParams)) { Console.ForegroundColor = ConsoleColor.White; Console.Write(text); } Console.ForegroundColor = ConsoleColor.Green; userInput = Console.ReadLine() ?? ""; }

Console.WriteLine("Hello, World!");

Environment & Configuration

Known Workarounds

No response

martindevans commented 1 month ago

Do you see the same behaviour if you use the same model and parameters with the precompiled llama.cpp example app?

Just a note about monitoring GPU usage in task manager. The default graph shows 3D rendering load, which is not applicable to llama.cpp. You want to look at the CUDA load:

Taskmgr_2024-07-09_13-40-08
nazihaghighi commented 1 month ago

I used another model and had the same problem, the other model I used was the same size and parameters. I also used llama.cpp for this work, which had the same problem, I also take pictures related to Gpu Cuda.

GpuLayerCount : 0 image

GpuLayerCount : 20 (no data is returned) image

mithril52 commented 4 weeks ago

I seem to be getting the same thing on my M3 Max MacBook Pro. I've tried multiple models in different sizes.

Environment & Configuration

Operating system: macOS Sonoma 14.5 .NET runtime version: 8.0.204 LLamaSharp version: 0.15 CUDA version (if you are using cuda backend): NA CPU & GPU device: Apple M3 Max 48GB RAM

mithril52 commented 4 weeks ago

Some additional notes to my situation, using the current version of llama.cpp, it does load all the layers to the GPU and has much better performance, vs llama sharp only loading 20 of the 33 layers in the model tot eh GPU.

Output when using llama sharp:

llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /Users/mithril/.ollama/models/blobs/sha256-747396b74887ed830d46c96443b48bde9a4daab5463f353330feb707ac387300 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = dolphin-2.9-llama3-8b llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 1 llama_model_loader: - kv 11: llama.vocab_size u32 = 128258 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128258] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,128258] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 128256 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 128001 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - type f32: 65 tensors llama_model_loader: - type f16: 226 tensors llm_load_vocab: missing pre-tokenizer type, using: 'default' llm_load_vocab:
llm_load_vocab: ****
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ****
llm_load_vocab:
llm_load_vocab: special tokens cache size = 258 llm_load_vocab: token to piece cache size = 0.8000 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128258 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 14.96 GiB (16.00 BPW) llm_load_print_meta: general.name = dolphin-2.9-llama3-8b llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128256 '<|im_end|>' llm_load_print_meta: PAD token = 128001 '<|end_of_text|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128256 '<|im_end|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.27 MiB ggml_backend_metal_log_allocated_size: allocated buffer, size = 9738.69 MiB, ( 9738.77 / 36864.00) llm_load_tensors: offloading 20 repeating layers to GPU llm_load_tensors: offloaded 20/33 layers to GPU llm_load_tensors: CPU buffer size = 15317.05 MiB llm_load_tensors: Metal buffer size = 9738.68 MiB ......................................................................................... llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 ggml_metal_init: allocating ggml_metal_init: found device: Apple M3 Max ggml_metal_init: picking default device: Apple M3 Max ggml_metal_init: using embedded metal library ggml_metal_init: GPU name: Apple M3 Max ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = true ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 38654.71 MB llama_kv_cache_init: CPU KV buffer size = 384.00 MiB llama_kv_cache_init: Metal KV buffer size = 640.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: CPU output buffer size = 0.49 MiB llama_new_context_with_model: Metal compute buffer size = 560.00 MiB llama_new_context_with_model: CPU compute buffer size = 560.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 195 ggml_metal_free: deallocating llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 ggml_metal_init: allocating ggml_metal_init: found device: Apple M3 Max ggml_metal_init: picking default device: Apple M3 Max ggml_metal_init: using embedded metal library ggml_metal_init: GPU name: Apple M3 Max ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = true ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 38654.71 MB llama_kv_cache_init: CPU KV buffer size = 384.00 MiB llama_kv_cache_init: Metal KV buffer size = 640.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: CPU output buffer size = 0.49 MiB llama_new_context_with_model: Metal compute buffer size = 560.00 MiB llama_new_context_with_model: CPU compute buffer size = 560.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 195

Output with llama.cpp directly:

llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /Users/mithril/.ollama/models/blobs/sha256-747396b74887ed830d46c96443b48bde9a4daab5463f353330feb707ac387300 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = dolphin-2.9-llama3-8b llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 1 llama_model_loader: - kv 11: llama.vocab_size u32 = 128258 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128258] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,128258] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 128256 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 128001 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - type f32: 65 tensors llama_model_loader: - type f16: 226 tensors llm_load_vocab: missing pre-tokenizer type, using: 'default' llm_load_vocab: llm_load_vocab: **** llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED! llm_load_vocab: CONSIDER REGENERATING THE MODEL llm_load_vocab: **** llm_load_vocab: llm_load_vocab: special tokens cache size = 258 llm_load_vocab: token to piece cache size = 0.8000 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128258 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 14.96 GiB (16.00 BPW) llm_load_print_meta: general.name = dolphin-2.9-llama3-8b llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128256 '<|im_end|>' llm_load_print_meta: PAD token = 128001 '<|end_of_text|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128256 '<|im_end|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.27 MiB ggml_backend_metal_log_allocated_size: allocated buffer, size = 14315.05 MiB, (14315.12 / 36864.00) llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 1002.02 MiB llm_load_tensors: Metal buffer size = 14315.04 MiB ......................................................................................... llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 ggml_metal_init: allocating ggml_metal_init: found device: Apple M3 Max ggml_metal_init: picking default device: Apple M3 Max ggml_metal_init: using embedded metal library ggml_metal_init: GPU name: Apple M3 Max ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = true ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 38654.71 MB llama_kv_cache_init: Metal KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: CPU output buffer size = 0.49 MiB llama_new_context_with_model: Metal compute buffer size = 560.00 MiB llama_new_context_with_model: CPU compute buffer size = 24.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2

martindevans commented 4 weeks ago

What ModelsParams are you using with llamasharp?

aropb commented 3 days ago

After switching from 0.13 to 0.16, I also got a two-fold decrease in performance. Previously, the Cuda GPU was loaded up to 80%, now it is about 30% maximum.

martindevans commented 2 days ago

@aropb can you try out the relevant versions of llama.cpp directly (see the bottom of the readme for the exact llama.cpp versions). LLamaSharp just wraps llama.cpp, so unless there's an error in passing across parameters performance issues are usually upstream issues.

aropb commented 2 days ago

Unfortunately, I can't do it without help.

martindevans commented 2 days ago

The two relevant releases of llama.cpp are:

0.13: https://github.com/ggerganov/llama.cpp/releases/tag/b2985 0.16: https://github.com/ggerganov/llama.cpp/releases/tag/b3616

Those links include precompiled versions of llama.cpp for most platforms :)

martindevans commented 2 days ago

Let's move performance discussion over to the new issue you opened (#921).