ggerganov / llama.cpp

LLM inference in C/C++
MIT License
66.96k stars 9.62k forks source link

Feature Request: support for nvidia/Llama-3.1-Minitron-4B-Width-Base #9060

Closed TyraVex closed 2 months ago

TyraVex commented 2 months ago

Prerequisites

Feature Description

Please support https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Width-Base

When I try to run F16 with llama-cli or produce imatrix usig llama-imatrix, i get the following crash:

llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
ggml/src/ggml.c:6399: GGML_ASSERT(c->ne[0] >= n_dims / 2) failed
ptrace: Operation not permitted.
No stack.
The program is not being run.
Aborted (core dumped)

Motivation

This 4B model is pruned and distilled form Llama 3.1 8B. It would be a great alternative for gemma 2b.

https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/

Possible Implementation

No response

0wwafa commented 2 months ago
llm_load_print_meta: general.name     = Llama 3.1 Minitron 4B Width Base
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =  2920.98 MiB
...........................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
/home/runner/work/llama.cpp/llama.cpp/ggml/src/ggml.c:6399: GGML_ASSERT(c->ne[0] >= n_dims / 2) failed
./build/bin/llama-cli(+0x1ce98b)[0x5c893458298b]
./build/bin/llama-cli(+0x1d0951)[0x5c8934584951]
./build/bin/llama-cli(+0x200767)[0x5c89345b4767]
./build/bin/llama-cli(+0x164e21)[0x5c8934518e21]
./build/bin/llama-cli(+0xfffa6)[0x5c89344b3fa6]
./build/bin/llama-cli(+0x11c670)[0x5c89344d0670]
./build/bin/llama-cli(+0x7afa6)[0x5c893442efa6]
./build/bin/llama-cli(+0x3ccc6)[0x5c89343f0cc6]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7912ef175d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7912ef175e40]
./build/bin/llama-cli(+0x5fb75)[0x5c8934413b75]
TyraVex commented 2 months ago

Here is also an element that could possibly fit under the "Motivation" category for this feature request:

https://www.reddit.com/r/LocalLLaMA/comments/1eu40jg/nvidia_releases_llama31minitron4bwidthbase_the_4b/

The community seems to be quite interested in this new model 🤗

greynewell commented 2 months ago

Also experiencing this. Will post here if i get a working solution

greynewell commented 2 months ago

Claude has this to say:

This error message suggests an issue with the GGML (Generative Geometric Machine Learning) library, which is commonly used in machine learning projects, particularly with language models like LLaMA. Let's break down the error:

The error occurs in the file ggml.c at line 6399. There's an assertion failure: GGML_ASSERT(c->ne[0] >= n_dims / 2) It's happening when trying to run llama-cli with a specific model.

This assertion failure typically indicates that there's a mismatch between the expected dimensions of a tensor and the actual dimensions provided. Specifically, it's checking if the first dimension of a tensor c is at least half the total number of dimensions.

which is curious to me, as when running llama-cli with --check-tensors like so, it passes:

llama-cli --hf-repo NikolayKozloff/Llama-3.1-Minitron-4B-Width-Base-Q8_0-GGUF --hf-file llama-3.1-minitron-4b-width-base-q8_0.gguf -p "The meaning to life and the universe is" --check-tensors

Full output of the run:

greynewell@grey llama_models % llama-cli --hf-repo NikolayKozloff/Llama-3.1-Minitron-4B-Width-Base-Q8_0-GGUF --hf-file llama-3.1-minitron-4b-width-base-q8_0.gguf -p "The meaning to life and the universe is" --check-tensors
Log start
main: build = 3600 (2fb92678)
main: built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.4.0
main: seed  = 1724104165
llama_download_file: previous metadata file found /Users/greynewell/Library/Caches/llama.cpp/llama-3.1-minitron-4b-width-base-q8_0.gguf.json: {"etag":"\"9759adb0f3e2a2e2ce129d9f8b39da0b-301\"","lastModified":"Fri, 16 Aug 2024 21:47:07 GMT","url":"https://huggingface.co/NikolayKozloff/Llama-3.1-Minitron-4B-Width-Base-Q8_0-GGUF/resolve/main/llama-3.1-minitron-4b-width-base-q8_0.gguf"}
llama_model_loader: loaded meta data with 31 key-value pairs and 292 tensors from /Users/greynewell/Library/Caches/llama.cpp/llama-3.1-minitron-4b-width-base-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.1 Minitron 4B Width Base
llama_model_loader: - kv   3:                       general.organization str              = Nvidia
llama_model_loader: - kv   4:                           general.finetune str              = Width-Base
llama_model_loader: - kv   5:                           general.basename str              = Llama-3.1-Minitron
llama_model_loader: - kv   6:                         general.size_label str              = 4B
llama_model_loader: - kv   7:                            general.license str              = other
llama_model_loader: - kv   8:                       general.license.name str              = nvidia-open-model-license
llama_model_loader: - kv   9:                       general.license.link str              = https://developer.download.nvidia.com...
llama_model_loader: - kv  10:                          llama.block_count u32              = 32
llama_model_loader: - kv  11:                       llama.context_length u32              = 131072
llama_model_loader: - kv  12:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  13:                  llama.feed_forward_length u32              = 9216
llama_model_loader: - kv  14:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  15:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  16:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  17:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  18:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  19:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  20:                          general.file_type u32              = 7
llama_model_loader: - kv  21:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  22:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 9216
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 4.51 B
llm_load_print_meta: model size       = 4.47 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = Llama 3.1 Minitron 4B Width Base
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.27 MiB
ggml_backend_metal_log_allocated_size: allocated buffer, size =  4573.25 MiB, ( 4573.33 / 27648.00)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   399.23 MiB
llm_load_tensors:      Metal buffer size =  4573.23 MiB
.....................................................................................
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Pro
ggml_metal_init: picking default device: Apple M3 Pro
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M3 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 28991.03 MB
llama_kv_cache_init:      Metal KV buffer size = 16384.00 MiB
llama_new_context_with_model: KV self size  = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
/tmp/llama.cpp-20240817-5170-91jvdr/ggml/src/ggml.c:6399: GGML_ASSERT(c->ne[0] >= n_dims / 2) failed
zsh: abort      llama-cli --hf-repo NikolayKozloff/Llama-3.1-Minitron-4B-Width-Base-Q8_0-GGUF

Overall, I find this confusing, as the hf repo seems to indicate this should work:

Screenshot 2024-08-19 at 2 51 24 PM
nyxkrage commented 2 months ago

This seems to be an issue with the llama3.1 rope scaling and custom head_dim being specified together, you can make a working quant removing the "rope_scaling" and changing "max_position_embeddings" to 8192 in config.json and then quanting it to gguf

ghost commented 2 months ago

Sucks.. 4B looks like a good candidate to FFT on

0wwafa commented 2 months ago

Llama-3.1-Minitron-4B-Width-Base looks amazing.

It's really a pity that llama.cpp does not support it.

raininja commented 2 months ago

This seems to be an issue with the llama3.1 rope scaling and custom head_dim being specified together, you can make a working quant removing the "rope_scaling" and changing "max_position_embeddings" to 8192 in config.json and then quanting it to gguf

this is the fix, it works.

devlux76 commented 2 months ago

Ok but doesn't this fix limit the context from 32k to 8k?

ThomasBaruzier commented 2 months ago

Not if https://github.com/ggerganov/llama.cpp/pull/9141 is merged @devlux76