Prerequisites

[X] I am running the latest code. Mention the version if possible as well.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Llama 3.1 was just released and it is a significant leg up from the previous series of models: https://huggingface.co/blog/llama31

Whilst the overall architecture is the same, it requires some modelling updates, primarily around RoPE scaling: https://github.com/huggingface/transformers/blob/bc2adb0112b6677b0dfb4105c74570a0f92183eb/src/transformers/modeling_rope_utils.py#L298

It'd be great to add support for those so that the generations are more coherent and make sense.

Motivation

Note: Without the modelling changes, the generation might look coherent, but they are far from great and the true-st potential of the model!

Possible Implementation

Here's the corresponding transformers implementation: https://github.com/huggingface/transformers/blob/bc2adb0112b6677b0dfb4105c74570a0f92183eb/src/transformers/modeling_rope_utils.py#L298

@bartowski1182 No need to update transformers. Instead, they updated each of the llama-3.1 models' configs.

they both behave identically

Except for the fact that smaug-bpe disables the automatic prefixing of <begin_of_text> to all contexts. Although, whether this token is absent from the context or not affects performance is certainly up for debate. I prefer to stick to Meta's official suggestions though.

@bartowski1182 No need to update transformers. Instead, they updated each of the llama-3.1 models' configs.

we have to re-do the GGUFs and all the quants or is this something we can edit in the metadata? (still not easy to edit gguf metadata on huggingface like a text file, but way better than redoing the whole quants)

@maziyarpanahi

@bartowski1182 No need to update transformers. Instead, they updated each of the llama-3.1 models' configs.

we have to re-do the GGUFs and all the quants or is this something we can edit in the metadata? (still not easy to edit gguf metadata on huggingface like a text file, but way better than redoing the whole quants)

 ./gguf-py/scripts/gguf_new_metadata.py --pre-tokenizer "llama-bpe" ./models/Meta-Llama-3.1-8B-Instruct/ggml-model-Q8_0.gguf ./models/Meta-Llama-3.1-8B-Instruct/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf

This can be used to prevent the hassle of reconverting.

@maziyarpanahi

@bartowski1182 No need to update transformers. Instead, they updated each of the llama-3.1 models' configs.

we have to re-do the GGUFs and all the quants or is this something we can edit in the metadata? (still not easy to edit gguf metadata on huggingface like a text file, but way better than redoing the whole quants)
 ./gguf-py/scripts/gguf_new_metadata.py --pre-tokenizer "llama-bpe" ./models/Meta-Llama-3.1-8B-Instruct/ggml-model-Q8_0.gguf ./models/Meta-Llama-3.1-8B-Instruct/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
This can be used to prevent the hassle of reconverting.

Will this also take into account the updated model config's?

@vlbosch I just finished double checking. It's nearly identical to the original LLaMA-3 tokenizer.json/tokenizer_config.json files. This should be set to go.

@m18coppola where does smaug-bpe disable to ? the only place I see it do anything is in llama.cpp where it sets the regex to be identical to llama 3 regex

https://github.com/ggerganov/llama.cpp/blob/f19bf99c015d3d745143e8bb4f056e0ea015ad40/src/llama-vocab.cpp#L358

and yet you're right 🤔 when changing the vocab to llama-bpe, it does start appending the <|begin_of_text|>...

ah dammit, I was searching by the LLAMA_VOCAB_PRE_TYPE_LLAMA3 of course.. thanks for finding that

llama3.1 is still giving wrong answers to my prompts even after using the latest llama.cpp. I also tried using the tokenizer from the HF repository and sending the tokenized string to llama-server. The problem is probably the missing rope implementation.

Here is my prompt: llama3.1-prompt.txt

The launch command is: ./llama-cli -ngl 99 -m Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -f /tmp/llama3.1-prompt.txt --temp 0.01 -sp --override-kv tokenizer.ggml.add_bos_token=bool:false

The llama.cpp implementation is not even able to process the prompt correctly and gives this answer:

{
  "answers": {
    "alice": 7,
    "bob": 7,
    "charlie": null
  },
  "reasoning": "Majority voting was used to decide the answer, ignoring math and logic errors.",
  "who is right": "Both Alice and Bob are right.",
  "answer": 7
}

While the implementation by @foldl gives the correct answer:

{
  "answers": {
    "alice": 7,
    "bob": 8,
    "charlie": 7
  },
  "reasoning": "Majority voting is used to decide the answer, ignoring math and logic errors.",
  "who is right": "Charlie and Alice",
  "answer": 7
}

@matteoserva Try again but add -c 8192 to the launch command. Unfortunately this will limit the context window to 8K, but in theory should mitigate the RoPE issues until it's implemented in llama.cpp

@matteoserva Try again but add -c 8192 to the launch command. Unfortunately this will limit the context window to 8K, but in theory should mitigate the RoPE issues until it's implemented in llama.cpp

@m18coppola I tried with -c 8192 too but still getting a wrong answer. Here is the updated prompt: ./llama-cli -ngl 99 -m Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -f /tmp/llama3.1-prompt.txt --temp 0.01 -sp --override-kv tokenizer.ggml.add_bos_token=bool:false -c 8192

@matteoserva llama-server with new UI gives proper answer, too:

Launched using llama-server -v -ngl 99 -m Meta-Llama-3.1-8B-Instruct-Q6_K.gguf -c 8192 command, configured as below:

@MoonRide303

I updated my answer. See below

Could you give me more info on the version of llama.cpp and llama3.1 quants you are using? I used the latest llama.cpp from master and tested both a model that I quantized myself and the model downloaded from here: https://huggingface.co/lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf

The launch command is the following: ./llama-server -v -ngl 99 -m Meta-Llama-3.1-8B-Instruct-Q6_K.gguf -c 8192

I tested your parameters in newUI using both chat and completion mode. In chat mode I copypasted the content of the user request. In completion mode I pasted the entire raw prompt.

The answer I'm getting doesn't match yours:

edit

With -ngl 0 the model gives the correct answer. Given that new information I suspect that the problem is in the cuda implementation.

My setup:

Nvidia 4060 16GB
Nvidia 3060 12GB

I'm loading the model using both cards in parallel.

with -ngl 99 the model gives a wrong answer with and without -fa With -ngl 0 the model gives the correct answer

edit 2

If I restrict the model to a single card by using CUDA_VISIBLE_DEVICES=0 or CUDA_VISIBLE_DEVICES=1 then the model answers correctly. With CUDA_VISIBLE_DEVICES=0,1 the model gives wrong answers.

@matteoserva I tested with llama.cpp b3452 (compiled locally using MSVC 2022 + CUDA 12.5, cmake -B build-gpu -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DBUILD_SHARED_LIBS=ON + cmake --build build-gpu --config Release -j 6), running on single GPU (RTX 4080), and used GGUF I've made from updated official Meta-Llama-3.1-8B-Instruct repo (with fixed tokenizer), in two passes:

convert_hf_to_gguf.py --outtype f16 ..\Meta-Llama-3.1-8B-Instruct\ --outfile Meta-Llama-3.1-8B-Instruct-F16.gguf
llama-quantize Meta-Llama-3.1-8B-Instruct-F16.gguf Meta-Llama-3.1-8B-Instruct-Q6_K.gguf Q6_K

@MoonRide303

I updated my answer. See below

Could you give me more info on the version of llama.cpp and llama3.1 quants you are using? I used the latest llama.cpp from master and tested both a model that I quantized myself and the model downloaded from here: https://huggingface.co/lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf

The launch command is the following: ./llama-server -v -ngl 99 -m Meta-Llama-3.1-8B-Instruct-Q6_K.gguf -c 8192

I tested your parameters in newUI using both chat and completion mode. In chat mode I copypasted the content of the user request. In completion mode I pasted the entire raw prompt.

The answer I'm getting doesn't match yours:

edit

With -ngl 0 the model gives the correct answer. Given that new information I suspect that the problem is in the cuda implementation.

My setup:

Nvidia 4060 16GB

Nvidia 3060 12GB

I'm loading the model using both cards in parallel.

with -ngl 99 the model gives a wrong answer with and without -fa With -ngl 0 the model gives the correct answer

edit 2

If I restrict the model to a single card by using CUDA_VISIBLE_DEVICES=0 or CUDA_VISIBLE_DEVICES=1 then the model answers correctly. With CUDA_VISIBLE_DEVICES=0,1 the model gives wrong answers.

wtf

Yesterday I used Bartowski gguf llama 3.1 q8

test perplexity : no difference if I set ctx 8192 or 100000 same result

build\bin\llama-perplexity.exe --model models/new3/Meta-Llama-3.1-8B-Instruct-imatrix.q8_0.gguf --threads 30 -ngl 99 --hellaswag --hellaswag-tasks 400 -f models\hellaswag_val_full.txt -c 8192

result - 79.25

Today used gguf imatrix from here https://huggingface.co/AI-Engine/Meta-Llama-3.1-8B-Instruct-GGUF/tree/main clams "Use the -imatrix versions (they use imatrix and the bpe-llama tokenizer which should theoretically improve the output)"

result - 80.50

result is improved but answers are still mostly wrong.

QUESTIONS:

I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?

answer 36

If my BMI is 20.5 and my height is 172cm, how much would I weigh if I gained 5% of my current weight?

answer 63.68kg

Lllama 8.1 8b with https://groq.com/ is giving here something between 63.4 - 63.5 what is excellent result BUT the local llm (build b3452) WHEN is "almost" correct have something 62.3 or 64.4 but mostly 0.6 kg or 137.2 kg ...

Local llm mostly wrong answers something 3/10 times is correct
But https://groq.com/ with llama 3.1 8b always answers those questions correctly 10/10 times

@mirek190 just updated my GGUFs an hour ago if you want to try those again but I assume it'll be the same

Below is the HF transformers code updated to support Llama-3.1 RoPE. With llama.cpp, long context currently only works using --rope-freq-base 8000000.0.

https://github.com/huggingface/transformers/commit/d5a99dfcee6e94065cb7c83cc8ab6fc5daa0cc4e#diff-29ed0a73809daedb87a8b026d23f31ef5e0caedd8865aaecdef8a3a22ed7ca24R298

def _compute_llama3_parameters(
    config: PretrainedConfig, device: "torch.device", seq_len: Optional[int] = None, **rope_kwargs
) -> Tuple["torch.Tensor", float]:
    """
    Computes the inverse frequencies for llama 3.1.
    Args:
        config ([`~transformers.PretrainedConfig`]):
            The model configuration.
        device (`torch.device`):
            The device to use for initialization of the inverse frequencies.
        seq_len (`int`, *optional*):
            The current sequence length. Unused for this type of RoPE.
        rope_kwargs (`Dict`, *optional*):
            BC compatibility with the previous RoPE class instantiation, will be removed in v4.45.
    Returns:
        Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the
        post-processing scaling factor applied to the computed cos/sin.
    """
    # Gets the default RoPE parameters
    inv_freq, attention_factor = _compute_default_rope_parameters(config, device, seq_len, **rope_kwargs)

    factor = config.rope_scaling["factor"]  # `8` in the original implementation
    low_freq_factor = config.rope_scaling["low_freq_factor"]  # `1` in the original implementation
    high_freq_factor = config.rope_scaling["high_freq_factor"]  # `4` in the original implementation
    old_context_len = config.rope_scaling["original_max_position_embeddings"]  # `8192` in the original implementation

    low_freq_wavelen = old_context_len / low_freq_factor
    high_freq_wavelen = old_context_len / high_freq_factor
    new_freqs = []
    for freq in inv_freq:
        wavelen = 2 * math.pi / freq
        if wavelen < high_freq_wavelen:
            new_freqs.append(freq)
        elif wavelen > low_freq_wavelen:
            new_freqs.append(freq / factor)
        else:
            assert low_freq_wavelen != high_freq_wavelen
            smooth = (old_context_len / wavelen - low_freq_factor) / (high_freq_factor - low_freq_factor)
            new_freqs.append((1 - smooth) * freq / factor + smooth * freq)
    inv_freq = torch.tensor(new_freqs, dtype=inv_freq.dtype, device=inv_freq.device)
    return inv_freq, attention_factor

@mirek190 just updated my GGUFs an hour ago if you want to try those again but I assume it'll be the same

yes , getting 80.50 now with perplexity

Yesterday I used Bartowski gguf llama 3.1 q8

test perplexity : no difference if I set ctx 8192 or 100000 same result
build\bin\llama-perplexity.exe --model models/new3/Meta-Llama-3.1-8B-Instruct-imatrix.q8_0.gguf --threads 30 -ngl 99 --hellaswag --hellaswag-tasks 400 -f models\hellaswag_val_full.txt -c 8192
result - 79.25

Today used gguf imatrix from here https://huggingface.co/AI-Engine/Meta-Llama-3.1-8B-Instruct-GGUF/tree/main clams "Use the -imatrix versions (they use imatrix and the bpe-llama tokenizer which should theoretically improve the output)"

result - 80.50

result is improved but answers are still mostly wrong.

QUESTIONS:
I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?
answer 36
If my BMI is 20.5 and my height is 172cm, how much would I weigh if I gained 5% of my current weight? 
answer 63.68kg

Lllama 8.1 8b with https://groq.com/ is giving here something between 63.4 - 63.5 what is excellent result BUT the local llm (build b3452) WHEN is "almost" correct have something 62.3 or 64.4 but mostly 0.6 kg or 137.2 kg ...

Local llm mostly wrong answers something 3/10 times is correct But https://groq.com/ with llama 3.1 8b always answers those questions correctly 10/10 times

Custom template or not , even temperature to 0 results are the same. Tested with:

custom template
build-in template
suggested

custom

llama-cli.exe --model models/new3/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 8196 --interactive -ngl 99 --simple-io --in-prefix "<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant." -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0

build-in template

llama-cli.exe --model models/new3/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 8196 --interactive -ngl 0 --simple-io -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0 --chat-template llama3

suggested

llama-cli.exe --model models/new3/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -p "<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nI have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -c 500 -ngl 50 -c 8192 --conversation

llamacpp giving correct answers more or less 3/10 of attempts https://groq.com/ giving correct 10/10 of attempts.

@mirek190 I suggest you test again with #8676 .

@mirek190 I suggest you test again with #8676 .

I built that .. win 11 make 10 tests for each test I restart llamacpp tested with temp 0 and 0.6 - no difference in performance

temp 0.6

llama-cli.exe --model models/new3/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 8196 --interactive -ngl 99 --simple-io --in-prefix "<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant." -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0.6
Log start
main: build = 3457 (0d3ce090)
main: built with MSVC 19.39.33523.0 for x64
main: seed  = 1721900620
llama_model_loader: loaded meta data with 33 key-value pairs and 291 tensors from models/new3/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 7
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - kv  29:                      quantize.imatrix.file str              = /models_out/Meta-Llama-3.1-8B-Instruc...
llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW)
llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   532.31 MiB
llm_load_tensors:      CUDA0 buffer size =  7605.33 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 8224
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1028.00 MiB
llama_new_context_with_model: KV self size  = 1028.00 MiB, K (f16):  514.00 MiB, V (f16):  514.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   562.07 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.07 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
main: in-suffix/prefix is specified, chat template will be disabled

system_info: n_threads = 30 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 |
llm_tokenizer_bpe::check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
main: interactive mode on.
Input prefix: '<|eot_id|><|start_header_id|>user<|end_header_id|>

'
Input suffix: '<|eot_id|><|start_header_id|>assistant<|end_header_id|>

'
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.600
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8224, n_batch = 2048, n_predict = -1, n_keep = 20

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - To return control to the AI, end your input with '\'.
 - To return control without starting a new line, end your input with '/'.

> I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river? /
Let's break down the events step by step:

1. You start with 10 apples.
2. You find 3 gold coins in the river, but this doesn't affect your apple count directly.
3. You lose 4 apples, leaving you with 6 apples.
4. You gain a gold coin from an unknown source (perhaps another river?).
5. Three birds drop 6 apples each, so that's 18 new apples. Now you have 24 apples in total.
6. You win 6 gold coins online and share them equally with your 2 teammates, which means you get 2 gold coins (6 ÷ 3 = 2).
7. Your current total is: 1) apples: 24 - 0 (since the gold coins don't affect apple count), 2) gold coins: 5 (3 from the river + 1 gained + 2 shared).

Now, let's calculate how many apples you can buy with your 5 gold coins:

Price of an apple: 0.5 coins
Number of apples you can buy = Total gold coins / Price per apple
= 5 / 0.5
= 10

You have enough gold coins to buy 10 more apples.

So, after buying the apples with your gold coins, you'll have a total of:

24 (initial apples) + 10 (newly bought apples)
= 34 apples

temp 0

llama-cli.exe --model models/new3/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 8196 --interactive -ngl 99 --simple-io --in-prefix "<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant." -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0
Log start
main: build = 3457 (0d3ce090)
main: built with MSVC 19.39.33523.0 for x64
main: seed  = 1721901236
llama_model_loader: loaded meta data with 33 key-value pairs and 291 tensors from models/new3/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 7
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - kv  29:                      quantize.imatrix.file str              = /models_out/Meta-Llama-3.1-8B-Instruc...
llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW)
llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   532.31 MiB
llm_load_tensors:      CUDA0 buffer size =  7605.33 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 8224
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1028.00 MiB
llama_new_context_with_model: KV self size  = 1028.00 MiB, K (f16):  514.00 MiB, V (f16):  514.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   562.07 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.07 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
main: in-suffix/prefix is specified, chat template will be disabled

system_info: n_threads = 30 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 |
llm_tokenizer_bpe::check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
main: interactive mode on.
Input prefix: '<|eot_id|><|start_header_id|>user<|end_header_id|>

'
Input suffix: '<|eot_id|><|start_header_id|>assistant<|end_header_id|>

'
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8224, n_batch = 2048, n_predict = -1, n_keep = 20

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - To return control to the AI, end your input with '\'.
 - To return control without starting a new line, end your input with '/'.

> I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river? /
What a delightful series of events!

Let's break it down step by step:

1. You start with 10 apples.
2. You find 3 gold coins in the river, but this doesn't affect your apple count directly.
3. You lose 4 apples, so you now have: 10 - 4 = 6 apples.
4. The birds drop 6 apples each, and there are three of them, so that's a total of 18 new apples! Now you have: 6 + 18 = 24 apples.
5. You win 6 gold coins online, but you share them equally with your 2 teammates, which means you get 2 gold coins (6 ÷ 3 = 2). You now have a total of 3 gold coins (3 from the river + 2 from the game).
6. With all your coins, you buy apples at a price of 0.5 coins per apple. Since you have 3 gold coins, you can buy: 3 ÷ 0.5 = 6 apples.

Now, let's add these new apples to your total count: 24 (from the birds) + 6 (bought with coins) = 30 apples.

As for where the river is, I'm afraid that information isn't provided in the problem statement. The city near the river could be anywhere!

CORRCT is 36

results:

temp 0.6

34 36 34 38 34 38 42 38 40 40

temp 0

30 30 30 30 30 30 30 30 30 30

Seems even worse now.

That actually confirms perplexity

llama-perplexity.exe --model models/new3/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf --threads 30 -ngl 99 --hellaswag --hellaswag-tasks 400 -f models\hellaswag_val_full.txt -c 8192

before that rope modification was 80.50 now is 78.25

@mirek190 I suggest you test again with #8676 .

I built that .. win 11 make 10 tests for each test I restart llamacpp tested with temp 0 and 0.6 - no difference in performance

@mirek190 Did you use the same GGUF or create a new one ? AFAIK gguf's have to be recreated for #8676 to work correctly...

@mirek190 I suggest you test again with #8676 .

I built that .. win 11 make 10 tests for each test I restart llamacpp tested with temp 0 and 0.6 - no difference in performance

@mirek190 Did you use the same GGUF or create a new one ? AFAIK gguf's have to be recreated for #8676 to work correctly...

I used this

https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/tree/main

It is a proper one?

because seems even worse now.

llama-perplexity.exe --model models/new3/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf --threads 30 -ngl 99 --hellaswag --hellaswag-tasks 400 -f models\hellaswag_val_full.txt -c 8192

before that rope modification was 80.50 now is 78.25

If no ..where I can get a proper gguf?

I used this

https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/tree/main

It is a proper one?

Just from the timestamps it seems this GGUF is to old to have been created with the fixes from #8676

Just from the timestamps it seems this GGUF is to old to have been created with the fixes from https://github.com/ggerganov/llama.cpp/pull/8676

Where I can get a proper gguf?

Where I can get a proper gguf?

I am currently creating fresh ones, will take some time as my server is pretty busy...

Where I can get a proper gguf?

I am currently creating fresh ones, will take some time as my server is pretty busy...

ok ..wait for a new gguf to test ;)

ok ..wait for a new gguf to test ;)

~It's uploading now, should appear here in a few minutes: https://huggingface.co/qwp4w3hyb/Meta-Llama-3.1-8B-Instruct-iMat-GGUF/tree/main~

It's here: https://huggingface.co/qwp4w3hyb/Meta-Llama-3.1-8B-Instruct-iMat-GGUF/resolve/main/meta-llama-3.1-8b-instruct-imat-Q8_0.gguf?download=true

llama-perplexity.exe --model models/new3/meta-llama-3.1-8b-instruct-imat-Q8_0.gguf --threads 30 -ngl 99 --hellaswag --hellaswag-tasks 400 -f models\hellaswag_val_full.txt -c 8192

perplexity 78.00 is even worse than before ...

NOW with temp 0 answer is proper (36) BUT with anything higher than temp 0 is almost always wrong , earlier probability correct answer I had around 3/10 now is like 1/10 times was correct or less ....

llama-cli.exe --model models/new3/meta-llama-3.1-8b-instruct-imat-Q8_0.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 8196 --interactive -ngl 99 --simple-io --in-prefix "<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant." -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0
Log start
main: build = 3457 (0d3ce090)
main: built with MSVC 19.39.33523.0 for x64
main: seed  = 1721904202
llama_model_loader: loaded meta data with 34 key-value pairs and 292 tensors from models/new3/meta-llama-3.1-8b-instruct-imat-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 7
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:             llama.rope.scaling.attn_factor f32              = 1.000000
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                      quantize.imatrix.file str              = /home/tristand/ai/models/Meta-Llama-3...
llama_model_loader: - kv  31:                   quantize.imatrix.dataset str              = /home/tristand/ai/tools/llama.cpp/cal...
llama_model_loader: - kv  32:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  33:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW)
llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   532.31 MiB
llm_load_tensors:      CUDA0 buffer size =  7605.34 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 8224
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1028.00 MiB
llama_new_context_with_model: KV self size  = 1028.00 MiB, K (f16):  514.00 MiB, V (f16):  514.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   562.07 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.07 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
main: in-suffix/prefix is specified, chat template will be disabled

system_info: n_threads = 30 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 |
llm_tokenizer_bpe::check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
main: interactive mode on.
Input prefix: '<|eot_id|><|start_header_id|>user<|end_header_id|>

'
Input suffix: '<|eot_id|><|start_header_id|>assistant<|end_header_id|>

'
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8224, n_batch = 2048, n_predict = -1, n_keep = 20

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - To return control to the AI, end your input with '\'.
 - To return control without starting a new line, end your input with '/'.

> I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river? /
What a delightful series of events!

Let's break down what happened:

1. You started with 10 apples.
2. You found 3 gold coins in the river, but that doesn't affect your apple count.
3. You lost 4 apples, so you now have: 10 - 4 = 6 apples.
4. The birds dropped 6 x 3 = 18 apples, so you gained 18 apples. Now you have: 6 + 18 = 24 apples.
5. You won 6 gold coins online and had to share them with your teammates, so you got 6 / 2 = 3 gold coins. You now have a total of: 3 (from the river) + 3 (won online) = 6 gold coins.
6. With all 6 gold coins, you bought apples at 0.5 coins per apple. So, you can buy: 6 / 0.5 = 12 more apples.

Now, let's add up your total number of apples:

24 (from the birds) + 12 (bought with gold coins) = 36 apples

As for where the river is, I'm afraid that information isn't provided in the scenario. The river could be anywhere!

Another example where on groq.com I am getting almost perfect answer very close to 63.68kg. Here even with temp 0 answer is 7.2765 kg ...wtf

For instance gemma 2 9b with llamacpp has no problems with those questions and the same level reasoning with groq.com

Is really something wrong with gguf of llamacpp ....

question

If my BMI is 20.5 and my height is 172cm, how much would I weigh if I gained 5% of my current weight?

llama-cli.exe --model models/new3/meta-llama-3.1-8b-instruct-imat-Q8_0.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 8196 --interactive -ngl 99 --simple-io --in-prefix "<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant." -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0
Log start
main: build = 3457 (0d3ce090)
main: built with MSVC 19.39.33523.0 for x64
main: seed  = 1721904952
llama_model_loader: loaded meta data with 34 key-value pairs and 292 tensors from models/new3/meta-llama-3.1-8b-instruct-imat-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 7
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:             llama.rope.scaling.attn_factor f32              = 1.000000
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                      quantize.imatrix.file str              = /home/tristand/ai/models/Meta-Llama-3...
llama_model_loader: - kv  31:                   quantize.imatrix.dataset str              = /home/tristand/ai/tools/llama.cpp/cal...
llama_model_loader: - kv  32:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  33:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW)
llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   532.31 MiB
llm_load_tensors:      CUDA0 buffer size =  7605.34 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 8224
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1028.00 MiB
llama_new_context_with_model: KV self size  = 1028.00 MiB, K (f16):  514.00 MiB, V (f16):  514.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   562.07 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.07 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
main: in-suffix/prefix is specified, chat template will be disabled

system_info: n_threads = 30 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 |
llm_tokenizer_bpe::check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
main: interactive mode on.
Input prefix: '<|eot_id|><|start_header_id|>user<|end_header_id|>

'
Input suffix: '<|eot_id|><|start_header_id|>assistant<|end_header_id|>

'
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8224, n_batch = 2048, n_predict = -1, n_keep = 20

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - To return control to the AI, end your input with '\'.
 - To return control without starting a new line, end your input with '/'.

> If my BMI is 20.5 and my height is 172cm, how much would I weigh if I gained 5% of my current weight? /
To calculate your current weight, we need to use the BMI formula: BMI = weight (in kg) / height (in meters)².

First, let's convert your height from centimeters to meters:
172 cm = 1.72 m

Now, we can plug in your BMI value and solve for weight:

20.5 = weight / (1.72)²
20.5 = weight / 2.9584
weight ≈ 6.93 kg

So, your current weight is approximately 6.93 kg.

To calculate the weight gain of 5%, we'll multiply your current weight by 0.05:
6.93 kg × 0.05 = 0.3465 kg

Now, add this to your current weight:
6.93 kg + 0.3465 kg ≈ 7.2765 kg

So, if you gained 5% of your current weight, you would weigh approximately 7.28 kg.

Keep in mind that BMI is just an estimate and doesn't account for muscle mass or body composition. If you're concerned about your weight or health, it's always best to consult a healthcare professional!

All tests should be done with temperature 0.0 btw. Using a temperature higher than 0 is just adding noise to your tests. This also means you only have to run the test one time instead of many times.

I'm thankful for the work of the contributors in this thread 🙌

@mirek190

Another example where on groq.com I am getting almost perfect answer very close to 63.68kg. Here even with temp 0 answer is 7.2765 kg ...wtf

For instance gemma 2 9b with llamacpp has no problems with those questions and the same level reasoning with groq.com

Is really something wrong with gguf of llamacpp ....

[...]

I get this answer, with temp 0.0, which seems closer to correct, but still differs from groq at temp 0.0 (which could be expected due to different quants?):

To calculate your current weight, we need to use the BMI formula: BMI = weight (in kg) / height (in meters)².

First, let's convert your height from centimeters to meters: 172 cm = 1.72 m.

Now, we can plug in your BMI value (20.5) and height (1.72 m) into the formula to solve for your current weight:

20.5 = weight / (1.72)²
20.5 = weight / 2.9584
weight = 20.5 × 2.9584
weight ≈ 60.5 kg

Now, let's calculate 5% of your current weight:
5% of 60.5 kg = 0.05 × 60.5 kg ≈ 3.025 kg

To find your new weight, add 5% of your current weight to your current weight:
New weight = current weight + 5% of current weight
New weight = 60.5 kg + 3.025 kg
New weight ≈ 63.525 kg

So, if you gained 5% of your current weight, you would weigh approximately 63.5 kg.

But instead of the llama-cli I used the server to utilize the built-in chat-templates + eot handling. This indicates to me that there's something different about your chat templates/prefix/suffix, compared to the builtin ones. Or maybe the repeat penalty is changing things???

Started server via:

``` ./llama-server -ngl 9999 -c 8096 -m ../../models_download/direct/meta-llama-3.1-8b-instruct-imat-Q8_0.gguf ```

Asked like this:

``` curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "temperature": 0.0, "messages": [ {"role": "system", "content": "You are a helpful, smart, kind, and efficient AI assistant."}, { "role": "user", "content": "If my BMI is 20.5 and my height is 172cm, how much would I weigh if I gained 5% of my current weight?" } ] }' | jq -r .choices[].message.content ```

All tests should be done with temperature 0.0 btw. Using a temperature higher than 0 is just adding noise to your tests. This also means you only have to run the test one time instead of many times.

I'm thankful for the work of the contributors in this thread 🙌

Each test was made separately by restarting llamacpp
why only with temp 0

As I tested with groq.com I am 100% there is no temp 0 as always output look different but answers are always CORRECT.

As I tested with groq.com I am 100% there is no temp 0 as always output look different but answers are always CORRECT.

You can set temp to 0 when using groq via console.groq.com

Spotted a gguf on X that I thought should be interesting for comparison

GGUF-FIXED

Spotted a gguf on X that I thought should be interesting for comparison

GGUF-FIXED

As far as I understand this only changes the tokenizer type from smaug-bpe to llama-bpe, which is already fixed in newer GGUF's as it was due to a upstream error in the huggingface repo. This also likely does not include the fixes from #8676

Also I can't run this as its 405B :sob:

@mirek190

Another example where on groq.com I am getting almost perfect answer very close to 63.68kg. Here even with temp 0 answer is 7.2765 kg ...wtf For instance gemma 2 9b with llamacpp has no problems with those questions and the same level reasoning with groq.com Is really something wrong with gguf of llamacpp .... [...]

I get this answer, with temp 0.0, which seems closer to correct, but still differs from groq at temp 0.0 (which could be expected due to different quants?):
To calculate your current weight, we need to use the BMI formula: BMI = weight (in kg) / height (in meters)².

First, let's convert your height from centimeters to meters: 172 cm = 1.72 m.

Now, we can plug in your BMI value (20.5) and height (1.72 m) into the formula to solve for your current weight:

20.5 = weight / (1.72)²
20.5 = weight / 2.9584
weight = 20.5 × 2.9584
weight ≈ 60.5 kg

Now, let's calculate 5% of your current weight:
5% of 60.5 kg = 0.05 × 60.5 kg ≈ 3.025 kg

To find your new weight, add 5% of your current weight to your current weight:
New weight = current weight + 5% of current weight
New weight = 60.5 kg + 3.025 kg
New weight ≈ 63.525 kg

So, if you gained 5% of your current weight, you would weigh approximately 63.5 kg.
But instead of the llama-cli I used the server to utilize the built-in chat-templates + eot handling. This indicates to me that there's something different about your chat templates/prefix/suffix, compared to the builtin ones. Or maybe the repeat penalty is changing things???

Started server via:
./llama-server -ngl 9999 -c 8096 -m ../../models_download/direct/meta-llama-3.1-8b-instruct-imat-Q8_0.gguf
Asked like this:
curl http://localhost:8080/v1/chat/completions \                                                                                                                                                                                                   
              -H "Content-Type: application/json" \
              -d '{
        "temperature": 0.0,
        "messages": [
          {"role": "system", "content": "You are a helpful, smart, kind, and efficient AI assistant."},
          {
            "role": "user",
            "content": "If my BMI is 20.5 and my height is 172cm, how much would I weigh if I gained 5% of my current weight?"
          }
        ]
     }' | jq -r .choices[].message.content

In that case why I am getting different response with temp 0?

llama-cli.exe --model models/new3/meta-llama-3.1-8b-instruct-imat-Q8_0.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 8196 --interactive -ngl 99 --simple-io --in-prefix "<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant." -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0
Log start
main: build = 3457 (0d3ce090)
main: built with MSVC 19.39.33523.0 for x64
main: seed  = 1721907494
llama_model_loader: loaded meta data with 34 key-value pairs and 292 tensors from models/new3/meta-llama-3.1-8b-instruct-imat-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 7
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:             llama.rope.scaling.attn_factor f32              = 1.000000
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                      quantize.imatrix.file str              = /home/tristand/ai/models/Meta-Llama-3...
llama_model_loader: - kv  31:                   quantize.imatrix.dataset str              = /home/tristand/ai/tools/llama.cpp/cal...
llama_model_loader: - kv  32:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  33:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW)
llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   532.31 MiB
llm_load_tensors:      CUDA0 buffer size =  7605.34 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 8224
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1028.00 MiB
llama_new_context_with_model: KV self size  = 1028.00 MiB, K (f16):  514.00 MiB, V (f16):  514.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   562.07 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.07 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
main: in-suffix/prefix is specified, chat template will be disabled

system_info: n_threads = 30 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 |
llm_tokenizer_bpe::check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
main: interactive mode on.
Input prefix: '<|eot_id|><|start_header_id|>user<|end_header_id|>

'
Input suffix: '<|eot_id|><|start_header_id|>assistant<|end_header_id|>

'
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8224, n_batch = 2048, n_predict = -1, n_keep = 20

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - To return control to the AI, end your input with '\'.
 - To return control without starting a new line, end your input with '/'.

> If my BMI is 20.5 and my height is 172cm, how much would I weigh if I gained 5% of my current weight? /
To calculate your current weight, we need to use the BMI formula: BMI = weight (in kg) / height (in meters)².

First, let's convert your height from centimeters to meters:
172 cm = 1.72 m

Now, we can plug in your BMI value and solve for weight:

20.5 = weight / (1.72)²
20.5 = weight / 2.9584
weight ≈ 6.93 kg

So, your current weight is approximately 6.93 kg.

To calculate the weight gain of 5%, we'll multiply your current weight by 0.05:
6.93 kg × 0.05 = 0.3465 kg

Now, add this to your current weight:
6.93 kg + 0.3465 kg ≈ 7.2765 kg

So, if you gained 5% of your current weight, you would weigh approximately 7.28 kg.

Keep in mind that BMI is just an estimate and doesn't account for muscle mass or body composition. If you're concerned about your weight or health, it's always best to consult a healthcare professional!

In that case why I am getting different response with temp 0?

As I said, this like indicates differences with your prefix/suffix compared to the builtin template and/or the repeat penalty changes something.

In that case why I am getting different response with temp 0?

As I said, this like indicates differences with your prefix/suffix compared to the builtin template and/or the repeat penalty changes something.

I reran it with trace logs and this is the prompt that the server makes with the builtin templates:

 prompt tokenized | tid="129900350971200" timestamp=1721907732 id_slot=0 id_task=0 n_ctx=8096 n_keep=0 n_prompt_tokens=61 prompt_tokens="<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nIf my BMI is 20.5 and my height is 172cm, how much would I weigh if I gained 5% of my current weight?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

@mirek190 With repeat penalty at 1.0 your prompt returns the correct answer

In that case why I am getting different response with temp 0?

As I said, this like indicates differences with your prefix/suffix compared to the builtin template and/or the repeat penalty changes something.

I reran it with trace logs and this is the prompt that the server makes with the builtin templates:
 prompt tokenized | tid="129900350971200" timestamp=1721907732 id_slot=0 id_task=0 n_ctx=8096 n_keep=0 n_prompt_tokens=61 prompt_tokens="<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nIf my BMI is 20.5 and my height is 172cm, how much would I weigh if I gained 5% of my current weight?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

via curl - win 11

C:\Users\mirek190>curl "http://127.0.0.1:8080/v1/chat/completions" ^
More? -H "Content-Type: application/json" ^
More? -d "{\"temperature\": 0.0, \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful, smart, kind, and efficient AI assistant.\"}, {\"role\": \"user\", \"content\": \"If my BMI is 20.5 and my height is 172cm, how much would I weigh if I gained 5%% of my current weight?\"}]}"
{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"To calculate your current weight, we need to use the BMI formula: BMI = weight (in kg) / height (in meters)².\n\nFirst, let's convert your height from centimeters to meters: 172 cm = 1.72 m.\n\nNow, we can use the BMI formula to calculate your current weight:\n\nBMI = 20.5 = weight / (1.72)²\nweight = BMI × (1.72)²\nweight = 20.5 × 2.959\nweight ≈ 60.5 kg\n\nNow, let's calculate 5% of your current weight:\n5% of 60.5 kg = 0.05 × 60.5 kg ≈ 3.025 kg\n\nTo find your new weight, add 5% of your current weight to your current weight:\nNew weight = current weight + 5% of current weight\nNew weight = 60.5 kg + 3.025 kg\nNew weight ≈ 63.525 kg\n\nSo, if you gained 5% of your current weight, you would weigh approximately 63.5 kg.","role":"assistant"}}],"created":1721908245,"model":"unknown","object":"chat.completion","usage":{"completion_tokens":232,"prompt_tokens":61,"total_tokens":293},"id":"chatcmpl-R7ZxebNrTrRvLuXLGuCodkeHQLvK9taC"}

seems ok

@mirek190 With repeat penalty at 1.0 your prompt returns the correct answer

YES with --repeat-penalty 1.0 and temp 0 is correct

HELP shows

 --repeat-penalty N       penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled)

So should be --repeat-penalty 1.0 as default.? But it is not?

llama-cli.exe --model models/new3/meta-llama-3.1-8b-instruct-imat-Q8_0.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 8196 --interactive -ngl 99 --simple-io --in-prefix "<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant." -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0 --repeat-penalty 1.0
Log start
main: build = 3457 (0d3ce090)
main: built with MSVC 19.39.33523.0 for x64
main: seed  = 1721908579
llama_model_loader: loaded meta data with 34 key-value pairs and 292 tensors from models/new3/meta-llama-3.1-8b-instruct-imat-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 7
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:             llama.rope.scaling.attn_factor f32              = 1.000000
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                      quantize.imatrix.file str              = /home/tristand/ai/models/Meta-Llama-3...
llama_model_loader: - kv  31:                   quantize.imatrix.dataset str              = /home/tristand/ai/tools/llama.cpp/cal...
llama_model_loader: - kv  32:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  33:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW)
llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   532.31 MiB
llm_load_tensors:      CUDA0 buffer size =  7605.34 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 8224
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1028.00 MiB
llama_new_context_with_model: KV self size  = 1028.00 MiB, K (f16):  514.00 MiB, V (f16):  514.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   562.07 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.07 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
main: in-suffix/prefix is specified, chat template will be disabled

system_info: n_threads = 30 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 |
llm_tokenizer_bpe::check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
main: interactive mode on.
Input prefix: '<|eot_id|><|start_header_id|>user<|end_header_id|>

'
Input suffix: '<|eot_id|><|start_header_id|>assistant<|end_header_id|>

'
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8224, n_batch = 2048, n_predict = -1, n_keep = 20

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - To return control to the AI, end your input with '\'.
 - To return control without starting a new line, end your input with '/'.

> If my BMI is 20.5 and my height is 172cm, how much would I weigh if I gained 5% of my current weight? /
To calculate your current weight, we need to use the BMI formula: BMI = weight (in kg) / height (in meters)².

First, let's convert your height from centimeters to meters: 172 cm = 1.72 m.

Now, we can use the BMI formula to calculate your current weight:

BMI = 20.5 = weight / (1.72)²
weight = BMI × (1.72)²
weight = 20.5 × 2.959
weight ≈ 60.5 kg

Now, let's calculate 5% of your current weight:
5% of 60.5 kg = 0.05 × 60.5 kg
= 3.025 kg

To find your new weight, add the 5% gain to your current weight:
New weight = current weight + 5% gain
= 60.5 kg + 3.025 kg
= 63.525 kg

So, if you gained 5% of your current weight, you would weigh approximately 63.53 kg.

YES

updated gguf q8 https://huggingface.co/qwp4w3hyb/Meta-Llama-3.1-8B-Instruct-iMat-GGUF/tree/main

--repeat-penalty 1.0

Fixed rope, updated gguf and --repeat-penalty 1.0 is solving mostly problems ( still not perfect) ! Now with temperature 0.6 answers are still mostly CORRERCT first 9/10 and second also 9/10

So most correct command I have is

llama-cli.exe --model models/new3/meta-llama-3.1-8b-instruct-imat-Q8_0.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 8196 --interactive -ngl 99 --simple-io -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0.6 --repeat-penalty 1.0 --chat-template llama3

or with own template

llama-cli.exe --model models/new3/meta-llama-3.1-8b-instruct-imat-Q8_0.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 8196 --interactive -ngl 99 --simple-io --in-prefix "<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant." -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0.6 --repeat-penalty 1.0

THANK YOU.

YES

updated gguf q8 https://huggingface.co/qwp4w3hyb/Meta-Llama-3.1-8B-Instruct-iMat-GGUF/tree/main
--repeat-penalty
Fixed rope, updated gguf and --repeat-penalty is solving mostly problems ( still not perfect) ! Now with temperature 0.6 answers are still mostly CORRERCT first 9/10 and second also 9/10

THANK YOU.

You're welcome :)

For whomever needs them I'm also making & uploading 70B quants now:

https://huggingface.co/qwp4w3hyb/Meta-Llama-3.1-70B-Instruct-iMat-GGUF

First one should appear in ~30 minutes.

I want ;)

llama-perplexity.exe --model models/new3/meta-llama-3.1-8b-instruct-imat-Q8_0.gguf --threads 30 -ngl 99 --hellaswag --hellaswag-tasks 400 -f models\hellaswag_val_full.txt -c 8192 --repeat-penalty 1.0

perplexity even with --repeat-penalty 1.0 is still 78.00 before rope changes and gguf update was 80.50

Check out your sampling parameters:

sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature

CFG -> does nothing here Penalties -> is 1.0 and 0.0, so does nothing top_k -> only keeps the top 40 predictions tfs_z -> disabled typical_p -> disabled top_p -> 0.95 discards most predictions (iirc) min_p -> 0.05 keeps alot, but top_p did most already temp -> changes the probs of the remaining (if its not 1.0); if < 1.0 extenuate already likelies, if > 1.0 increase less likelies temp 0.0 should act like, only keep the most likely, same as top_k of 1

@mirek190

perplexity even with --repeat-penalty 1.0 is still 78.00

sampling does not occur while calculating perplexity -> no effect

@tristandruyen

YES updated gguf q8 https://huggingface.co/qwp4w3hyb/Meta-Llama-3.1-8B-Instruct-iMat-GGUF/tree/main
--repeat-penalty
Fixed rope, updated gguf and --repeat-penalty is solving mostly problems ( still not perfect) ! Now with temperature 0.6 answers are still mostly CORRERCT first 9/10 and second also 9/10 THANK YOU.
You're welcome :)

For whomever needs them I'm also making & uploading 70B quants now:

https://huggingface.co/qwp4w3hyb/Meta-Llama-3.1-70B-Instruct-iMat-GGUF

First one should appear in ~30 minutes.

Thank you very much for making those available!

I wonder if it might be a good idea to post some additional details in the readme -- "requires the latest master" is not so clear as listing which commit hash was used when the files were generated in case days / weeks from now some other changes might be relevant for people to check if the GGUFs they have previously pulled should be regenerated or not with an even newer llama.cpp version.

Also it'd probably help people learn & clarify process uncertainties if the command log needed to create the ggufs was listed in the readme or an auxiliary file so it could be more clear how to do it and if any special / unusual steps could be needed to regenerate / self-generate such GGUFs in the future i.e. you already mentioned a key PR patch needed though IDK to what extent sometimes conversion arguments or configuration file settings may be needed to be customized wrt. what the scripts can automatically derive based on the model name & json files etc. etc.

Thank you very much for making those available!

I wonder if it might be a good idea to post some additional details in the readme -- "requires the latest master" is not so clear as listing which commit hash was used when the files were generated in case days / weeks from now some other changes might be relevant for people to check if the GGUFs they have previously pulled should be regenerated or not with an even newer llama.cpp version.

Also it'd probably help people learn & clarify process uncertainties if the command log needed to create the ggufs was listed in the readme or an auxiliary file so it could be more clear how to do it and if any special / unusual steps could be needed to regenerate / self-generate such GGUFs in the future i.e. you already mentioned a key PR patch needed though IDK to what extent sometimes conversion arguments or configuration file settings may be needed to be customized wrt. what the scripts can automatically derive based on the model name & json files etc. etc.

I usually update the description with a link to the commit hash to the required version once the relevant PR's are merged.

I also really need to automate README.md/model card generation, guess I could add then add a link with further information in there...

If you want more details on how the whole process works this is a gist of my conversion script:

https://gist.github.com/tristandruyen/941d2e0526e4aedfa026e4e53411a4dc

I think this is starting to get a bit off-topic, so if you have any further questions feel free to reach out via the email on my github profile. tristan@vault81.mozmail.com

ggerganov / llama.cpp

Feature Request: Proper Llama 3.1 Support in llama.cpp #8650

Prerequisites

Feature Description

Motivation

Possible Implementation

edit

edit 2

edit

edit 2