ggerganov / llama.cpp

LLM inference in C/C++
MIT License
61.21k stars 8.74k forks source link

Investigate gemma 2 generation quality #8240

Open ngxson opened 3 days ago

ngxson commented 3 days ago

Initial reports can be seen from https://github.com/ggerganov/llama.cpp/pull/8227

TODO: add more info here

qnixsynapse commented 3 days ago

Just to confirm, gemma2 's window size is hard coded right?

ngxson commented 3 days ago

Ref comment: https://github.com/ggerganov/llama.cpp/pull/8227#issuecomment-2198638906

Issue with math questions may indicate problem with tokenizer, we should firstly try if llama.cpp tokenizer matches gemma2's tokenizer result or not.

ngxson commented 3 days ago

Just to confirm, gemma2 's window size is hard coded right?

The default value if hard-coded (in order not to break existing gguf), but the value will be override with the one in gguf (in case you re-convert to get new gguf)

Metadata key is gemma2.attention.sliding_window

BugReporterZ commented 3 days ago

For what it's worth, I have found that Gemma-2-27B quantized to Q6_K often makes mistakes/typos with proper names compared to Gemma-2-8B in Q8_0. I don't think the difference in quantization quality would be so large, but this could be something to watch for.

matteoserva commented 3 days ago

I tested all working implementations of the gemma-2-27b inference code. the implementation in llama.cpp either outputs subpar results or breaks completely.

Reference models:

Compared implementations:

Not tested: hf transformers

launch commands

gemma.cpp:

./gemma --tokenizer ./gemma-tokenizer.spm --model 27b-it --compressed_weights ./gemma-2-27b-it-sfp.sbs --temperature 0.01

chatllm:

./obj/main -m ./gemma-2-27b-it-Q8_0.bin -i

llama.cpp:

$ python3 convert-hf-to-gguf.py ./gemma-2-27b-it/ --outfile ./gemma-2-27b-it.gguf $ ./llama-server -ngl 15 -t 6 -c 8192 --host 0.0.0.0 -m ./gemma-2-27b-it.gguf --override-kv tokenizer.ggml.add_bos_token=bool:false

Outputs:

gemma.cpp:

`tanto va la gatta al lardo che ci lascia lo zampino.

chatllm.cpp at Q8_0:

`tanto va la gatta al lardo che ci lascia lo zampino.

ai studio with temperature 1.0:

`tanto va la gatta al lardo che ci lascia lo zampino.

llama.cpp at temperature 0.01:

<bos><start_of_turn>user
Completa la frase: tanto va la gatta al lardo che...<end_of_turn>
<start_of_turn>model
... **se la scrofa la ingrassa.** 

Esta es una frase hecha italiana que significa que si alguien insiste [...]

Analysis of results

The model in llama.cpp spits out random italian words and then starts speaking spanish. All the other implementation return the correct answer. llama.cpp gives incorrect responses even at low quantization or without quantization. The other implementations give the same correct response at Q8_0 or at high temperature.

I tried many other questions from my benchmarks. The other three models all agree to the same correct response. llama.cpp gives a different and incorrect response.

EDIT: formatting and paths

qnixsynapse commented 3 days ago

9B-IT is working great and now I can increase the ctx size. :)

ngxson commented 3 days ago

Issue with math questions may indicate problem with tokenizer, we should firstly try if llama.cpp tokenizer matches gemma2's tokenizer result or not.

Don't know if I'm heading the right direction or not:

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b")

chktxt = 'Repeat the question and then answer it: Matteo has 20 apples, he buys 20 oranges. Then he discards half of his fruits equally. Then he discards a quarter of his fruits equally between apples and oranges. How many apples remain?'

tokenizer(chktxt)['input_ids'][1:]

# [41422, 573, 2872, 578, 1492, 3448, 665, 235292, 100006, 919, 235248, 235284, 235276, 34188, 235269, 693, 58015, 235248, 235284, 235276, 72638, 235265, 5040, 693, 9027, 2050, 3933, 576, 926, 16803, 16404, 235265, 5040, 693, 9027, 2050, 476, 9453, 576, 926, 16803, 16404, 1865, 34188, 578, 72638, 235265, 2250, 1767, 34188, 5822, 235336]

Compared to the llama.cpp output (using llama-server):

{"tokens":[41422,573,2872,578,1492,3448,665,235292,100006,919,235248,235284,235276,34188,235269,693,58015,235248,235284,235276,72638,235265,5040,693,63845,235256,3933,576,926,16803,16404,235265,5040,693,63845,235256,476,9453,576,926,16803,16404,1865,34188,578,72638,235265,2250,1767,34188,5822,235336]}

The word discards is tokenized differently:

tristandruyen commented 3 days ago

I noticed something possibly interesting:

The old but closer to correct GGUF [Q6_K_L] is from this commit (I matched the sha256 hashes to make sure)

AFAIK these initial versions, were not created from scratch by llama.cpp, but based on the f32 GGUF provided directly by google on kaggle, although AFAIK these initial GGUFs had various other issues...

I see 2 possible causes:

Logs:

  1. curl is with a "new" GGUF
  2. curl is with the linked 4 day old GGUF (both Q6_K_L)

❯ curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "temperature": 0.1, "messages": [ { "role": "user", "content": "Completa la frase: tanto va la gatta al lardo che..." } ] }' {"choices":[{"finish_reason":"stop","index":0,"message":{"content":"... se la scrofa la ingrassa. \n\nEsta es una frase hecha italiana que significa que si alguien insiste mucho en algo, al final lo conseguirá, aunque sea por casualidad o por la ayuda de alguien más. \n","role":"assistant"}}],"created":1719853875,"model":"unknown","object":"chat.completion","usage":{"completion_tokens":51,"prompt_tokens":24,"total_tokens":75},"id":"chatcmpl-uXDEjiyq0JGjwgg1qTlA2LGqEDhTxxsG"}⏎

❯ curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "temperature": 0.1, "messages": [ { "role": "user", "content": "Completa la frase: tanto va la gatta al lardo che..." } ] }' {"choices":[{"finish_reason":"stop","index":0,"message":{"content":"...ci si lascia lo zampino. \n","role":"assistant"}}],"created":1719853954,"model":"unknown","object":"chat.completion","usage":{"completion_tokens":12,"prompt_tokens":42,"total_tokens":54},"id":"chatcmpl-jKmHo2x1dViomeiWLc8K6F3o1WJRsccT"}⏎


launch command (latest llama.cpp 49122a873f54615626d1b49a2a39013ed4be98d5):

./llama-server -ngl 999 -c 4000 --host 0.0.0.0 -m path_to.gguf --chat-template gemma2

matteoserva commented 3 days ago

@tristandruyen I think the result you provided is still wrong even for the outdated gguf.

The response from outdated gguf is "ci si lascia lo zampino". The only correct response for that question is "ci lascia lo zampino". I used that test for the exact reason that it doesn't admit any variation in the response.

tristandruyen commented 3 days ago

@tristandruyen I think the result you provided is still wrong even for the outdated gguf.

The response from outdated gguf is "ci si lascia lo zampino". The only correct response for that question is "ci lascia lo zampino". I used that test for the exact reason that it doesn't admit any variation in the response.

My bad, as I do not speak italian my brain parsed it as correct... It's still kinda interesting that it's much closer to the correct response though....

bartowski1182 commented 3 days ago

We still don't know what the conversion code Google used was, so it's possible that yes there's still something missing...

But the Google one definitely has a bad tokenizer, so if that was somehow fixed we may be able to see the proper performance, if only someone was able to contact them 🥲

ggerganov commented 3 days ago

@ngxson This indicates a problem with the tokenizer conversion. I don't fully understand the details to fix it, but a simple observation that I found is using:

diff --git a/convert-hf-to-gguf.py b/convert-hf-to-gguf.py
index 4a7f500f..d7eaf9cd 100755
--- a/convert-hf-to-gguf.py
+++ b/convert-hf-to-gguf.py
@@ -2345,7 +2345,7 @@ class Gemma2Model(Model):
     model_arch = gguf.MODEL_ARCH.GEMMA2

     def set_vocab(self):
-        self._set_vocab_llama_hf()
+        self._set_vocab_sentencepiece()
         self.gguf_writer.add_add_space_prefix(False)

     def set_gguf_parameters(self):

This would tokenize correctly the word "discards", but there are other problems with added/special tokens not being added at all. So some fix for the vocabulary conversion is necessary

JeroenAdam commented 3 days ago

For me, Gemma2 27b is going off the rails as soon as 'slot context shift' occurs. I get high quality output until that point. My config: latest build b3274 CUDA on Quadro P5000, 7K context set and running Q3_K_M (uploaded yesterday by bartowski). Here is an example of Java code abruptly followed by totally unrelated stuff.

**3. Security config

java @Configuration public class SecurityConfig extends WebSecurityConfigurerAdapter {

@Override
protected void configure(HttpSecurity http) throws Exception {
    http.authorizeRequests().
    addFilter(new ApiKeyAuthenticationFilter());
}

**Exploring the Nature of Light

Introduction:

Light is an essential aspect of our universe, influencing everything from the smallest atom to the largest galaxy.

Understanding the nature of light, how it interacts, and its properties are fundamental to many scientific fields, including physics, astronomy, and biology.

**Wave-Particle Duality: The Double Nature of Light

The nature of light has been a subject of much debate and experimentation. It was not until the 20th century that a satisfactory explanation of light emerged - the concept of wave-particle duality.

0wwafa commented 3 days ago

For what it's worth, I have found that Gemma-2-27B quantized to Q6_K often makes mistakes/typos with proper names compared to Gemma-2-8B in Q8_0. I don't think the difference in quantization quality would be so large, but this could be something to watch for.

That's because, as I am trying to explain since 2 weeks, the quantizing is "wrong". Check my Q5 & Q6 and you will see the difference: https://huggingface.co/ZeroWw/gemma-2-9b-it-GGUF

tristandruyen commented 3 days ago

For what it's worth, I have found that Gemma-2-27B quantized to Q6_K often makes mistakes/typos with proper names compared to Gemma-2-8B in Q8_0. I don't think the difference in quantization quality would be so large, but this could be something to watch for.

That's because, as I am trying to explain since 2 weeks, the quantizing is "wrong". Check my Q5 & Q6 and you will see the difference: https://huggingface.co/ZeroWw/gemma-2-9b-it-GGUF

Bartowski and others already provide GGUF's with output and embed tensors quantized as f16 as _L variants...

Also I wouldn't call people wrong for providing standard GGUF variants with standard settings. Your GGUF's are basically a new variant. That's why they got a new name in bartowski's repos...

matteoserva commented 3 days ago

From the hf blog.

"Running in float16 may be faster on your hardware, and results should be similar on the 9B model. Do note, however, that the 27B instruction-tuned model produces erratic outputs when using float16: you must use bfloat16 for that model weight."

Could this be relevant? I'm not familiar enough with the llama.cpp codebase to check this myself. The guuf by google is in float32 while the hf model is in bf16.

bartowski1182 commented 3 days ago

Honestly @matteoserva you may have a point, but I would hope that it's not relevant if we go bf16 to FP32 to fp16.. could try _XL versions where I leave embed and output at f32 LOL but that better not make any difference, would be pretty weird..

But yeah if even converting to f32 doesn't work properly, it's a deeper issue. My guess is Google was referring to take the bf16 and on-the-fly running it as fp16 which could definitely degrade performance at edge cases (I think we saw this in Qwen2?)

oldmanjk commented 3 days ago

"[!WARNING]
Gemma 2 is currently incompatible with Flash Attention/ SDPA, using it might result in unreliable generations. Use at your own risk."

https://huggingface.co/google/gemma-2-27b-it/discussions/17/files

matteoserva commented 3 days ago

@bartowski1182

Bfloat16->float32->float16 is generally an invalid conversion since float16 doesn't have the same range as the other two.

Is there a reason to think that the model weights are in the float16 range even if they are in the bfloat16 format?

qnixsynapse commented 3 days ago

Just to mention here, when I was converting the HF gemma2 to bft16 gguf, I noticed that the norm tensors were converted to fp16 instead of directly copying them from HF safetensors which were in bf16. I found that behaviour quite odd. I even supplied --outtype bf16 parameter.

ngxson commented 3 days ago

@ngxson This indicates a problem with the tokenizer conversion. I don't fully understand the details to fix it, but a simple observation that I found is using:

This would tokenize correctly the word "discards", but there are other problems with added/special tokens not being added at all. So some fix for the vocabulary conversion is necessary

@ggerganov Simply apply this change, I get perplexity from 9.5613 to 7.7898

My laptop is potato, I only tested with just 3 chunks of wiki.test.raw, so don't know if I mess up something or not.


With self._set_vocab_llama_hf()

[1]4.3818,[2]8.5469,[3]9.5613,
Final estimate: PPL = 9.5613 +/- 2.42077

With self._set_vocab_sentencepiece() ==> makes more sense, since gemma 1 uses this

[1]4.4272,[2]8.4867,[3]7.7898,
Final estimate: PPL = 7.7898 +/- 1.78301
arch-btw commented 3 days ago

Feel free to ignore this if it's not relevant but I noticed the json is invalid in the tokenizer.json on one line:

jsongemma2

The line in question:

jsongemma2_2

bartowski1182 commented 3 days ago

@matteoserva it's been shown that upcasting to FP32 before going to fp16 maintains a bit more accuracy than doing the conversion directly, but yes you lose out on some of the range and if Gemma 2 has a ton of values that fall outside the fp16 range that are extremely important they're different then I guess that could do it.

Does that really seem likely to be the issue? Especially when quantizing, almost zero and really almost zero are always going to basically be zero.. I'd think it more important to maintain the relationships in the middle of the range rather than the whole range (which probably matters more in training)

I suppose in an ideal world we could keep the embeddings and outputs at bf16, but then we lose GPU support (I think?)

Embeddings at f32 seems like it should be overly excessive for a quantized model, and I'd hope we never need to do that since that would be a huge increase in final size...

Maybe we need to prioritize GPU support of bf16 more, but I'm so far from the expertise required that I'm in no position to push for it lol

Take what I say with a grain of salt please 😅

bartowski1182 commented 3 days ago

@ngxson the problem with sentencepiece is it's not tokenizing the start and end tokens correctly, so it may have better PPL but it produces worse results

There's clearly some middle ground we're missing

matteoserva commented 3 days ago

@bartowski1182

Sorry for asking so many questions but I'm really missing the reason why you assume that converting to float16 is possible at all.

The maximum value for a float16 is 65535. The maximum value of a bfloat16 is 10^38. The maximum value of a float32 is 10^38.

I also expect most of the original weights to be greater than 65k since putting a constraint on their value would waste 20% of the bits of a bfloat16 value.

Is there some sort of quantization applied when converting gemma from bfloat to float32 to float16? In other words, how are you compressing a number from the range ±10^38 to another format whose range is ±65535? A naive division is not possible.

I suppose that models released directly in float32 format have the additional constraint that their weights are in a small range around 0, that's why the conversion to float16 is possible. Gemma2 was instead released in bfloat16 format which doesn't allow a trivial conversion to float16.

steampunque commented 3 days ago

I ran some bench suites on my own Q6_K non-imatrix quant and the 9b model is doing well on benchmarks. It hits 0.902 on GSM8K which is the highest I have seen on any model I have ever run and it averaged 0.653 on BBH which is quite good. My benches are different from the standard evaluation harness. For MC I require match on a doublecheck question where I circular shift all the answers 1 letter to make sure the model follows the right answer and I also use custom prompted CoT where necessary (MCs which require thinking, GSM8K, etc.) . I also zero shot everything except for a couple 3 shots for BBH categories (dyck languages and word ordering).

This quant was generated prior to the sliding attention patch but that shouldnt make difference since I limit CoT to 2500 tokens.

bench_gemma-2-9b-it.json

ngxson commented 3 days ago

the problem with sentencepiece is it's not tokenizing the start and end tokens correctly, so it may have better PPL but it produces worse results

@bartowski1182 FYI, I make a quick hack to support special tokens (including ones used for chat template): https://github.com/ggerganov/llama.cpp/pull/8244

oldmanjk commented 3 days ago

@matteoserva it's been shown that upcasting to FP32 before going to fp16 maintains a bit more accuracy than doing the conversion directly, but yes you lose out on some of the range and if Gemma 2 has a ton of values that fall outside the fp16 range that are extremely important they're different then I guess that could do it.

Does that really seem likely to be the issue? Especially when quantizing, almost zero and really almost zero are always going to basically be zero.. I'd think it more important to maintain the relationships in the middle of the range rather than the whole range (which probably matters more in training)

I suppose in an ideal world we could keep the embeddings and outputs at bf16, but then we lose GPU support (I think?)

Embeddings at f32 seems like it should be overly excessive for a quantized model, and I'd hope we never need to do that since that would be a huge increase in final size...

Maybe we need to prioritize GPU support of bf16 more, but I'm so far from the expertise required that I'm in no position to push for it lol

Take what I say with a grain of salt please 😅

No, you're absolutely right. bf16 cuda support in llama.cpp should have been prioritized a long time ago, as many of us have been saying (and no, we, the users, the non-devs, can't just do it ourselves)

matteoserva commented 3 days ago

Sorry if it's a dumb question: Is cuda bfloat16 support really necessary right now? If quantization is done on CPU, then the inference can be done in the quantized format without using bfloat16 values.

bartowski1182 commented 3 days ago

@matteoserva so maybe the issue is that i'm being naive in assuming how the conversion is handled...

Taking a very simple case of trying to convert a range of 0-100 to a range of 0-20, you wouldn't just say "okay all values greater than 20 are now just called 20"

You'd do something more clever, like 100 = 20, 90 = 18, 75 = 15 etc and then you could use a scaling factor, similar to how normal llama.cpp quants work, but maybe i'm way off base and it's actually as silly as throwing away everything that was greater than what f16 can express..

I am also basing some of my assumptions on findings like this: https://github.com/ggerganov/llama.cpp/pull/7150#issuecomment-2101575393

How can fp16 and bf16 be that similar if bf16 represents such an astronomically different range? is it really just that most of the time when the value is above 65535 it just doesn't matter much to the final result?

bartowski1182 commented 3 days ago

@matteoserva regarding the bf16 question, I'm referring to leaving the embedding and output layers as bf16 (instead of quantizing them) to save quality, since i've been making them fp16 but may currently be learning that that doesn't do as much as I hoped. I am also making the assumption that if ANY weights are bf16, CUDA will fail (this seems likely but i would need to test)

MoonRide303 commented 3 days ago

@matteoserva regarding the bf16 question, I'm referring to leaving the embedding and output layers as bf16 (instead of quantizing them) to save quality, since i've been making them fp16 but may currently be learning that that doesn't do as much as I hoped. I am also making the assumption that if ANY weights are bf16, CUDA will fail (this seems likely but i would need to test)

That might depend on GPU - RTX 3000 and newer should support bf16.

bartowski1182 commented 3 days ago

Not in ggml CUDA they don't, there's a PR open to add it: https://github.com/ggerganov/llama.cpp/pull/7488

bartowski1182 commented 3 days ago

@matteoserva after digging far, you're definitely correct, it's just very naively clamping the values.. Which almost makes me wonder if it's actually BETTER to quantize to Q8 instead of f16, since at least that will attempt to maintain the ranges with scaling factors right? Like ideally if you want max quality you'd do f32, if GGML CUDA supported bf16 you'd use that, but it seems to me f16 must be worse than Q8... I'm amazed f16 works at all...

slaren commented 3 days ago

Models weights are typically normalized to a range of -1 to 1. When models need higher precision it is usually because they generate activations above the range of float16, not because the weights themselves are outside of the range. I don't think there are any models with weights above the range of a float16 outside of maybe poorly made finetunes or merges.

0wwafa commented 3 days ago

@matteoserva after digging far, you're definitely correct, it's just very naively clamping the values.. Which almost makes me wonder if it's actually BETTER to quantize to Q8 instead of f16, since at least that will attempt to maintain the ranges with scaling factors right? Like ideally if you want max quality you'd do f32, if GGML CUDA supported bf16 you'd use that, but it seems to me f16 must be worse than Q8... I'm amazed f16 works at all...

that's why I quantize output and embed to f16 and the other tensors to q6_k or q5_k. check my quants and you'll see.

bartowski1182 commented 3 days ago

ahhhhhhh yes right, that makes a LOT more sense.. so it's probably the range closer to 0 that's more relevant.. Still makes me wonder if avoiding f16 is better, and that Q8 represents the range more accurately

Yes @0wwafa I'm aware and I've been making quants with the f16 embeddings. We've talked about this many times, and I still have massive doubts that it's improving anything. Maybe if we were keeping it at bf16 or f32, but f16 seems like it's discarding way too many values to be useful.

oldmanjk commented 3 days ago

Sorry if it's a dumb question: Is cuda bfloat16 support really necessary right now? If quantization is done on CPU, then the inference can be done in the quantized format without using bfloat16 values.

No worries. Not a dumb question. bf16 cuda support would help immensely during the generation of imatrices, which is done on GPU. As it is, I have to upconvert bf16 to f32 (doubling the model's size) and generate the imatrix from that. This doubles the storage required, increases significant wear and tear on expensive nvme's, and at least doubles the amount of time and energy required. Generating high-quality imatrices on large models takes days (and I'm not even using large datasets). It will also be useful for the case that you just want to inference bf16 models directly with basically no conversion. So, to answer your question, it's not necessary, no. None of this is. But I think it deserves to be prioritized. At the very least, I think it would be wise to stop assuming these lossy conversions are insignificant. I don't think we understand these things well enough yet

matteoserva commented 3 days ago

I managed to get the correct results. Now the output matches exactly all the other implementations.

Step to reproduce:

That way you never make the bfloat->float16 conversion.

oldmanjk commented 3 days ago

Feel free to ignore this if it's not relevant but I noticed the json is invalid in the tokenizer.json on one line:

jsongemma2

The line in question:

jsongemma2_2

This seems relevant to me...is it not? I made a post on your behalf on huggingface. I hope that's okay. If not, let me know

bfroemel commented 2 days ago

Here some perplexity numbers (wikitext-2-raw/wiki.test.raw, 8k context size) on different conversions of https://huggingface.co/google/gemma-2-27b-it/tree/main

bf16(hf)->f16

Final estimate: PPL = 5.8665 +/- 0.03709

bf16(hf)->bf16(gguf)->q8_0

Final estimate: PPL = 5.8710 +/- 0.03712

bf16(hf)->f16(gguf)->q8_0

Final estimate: PPL = 5.8710 +/- 0.03712
./llama-perplexity -f ./wiki.test.raw -m /models/model.gguf -ngl 99 -c 8192

/edit: ok, no surprise there - the perplexity numbers for both q8_0 versions are the same, because I ended up with the same files:

md5sum ./model-q8_0.bin ./model-q8_0_v2.bin
5262f0fd711ab1af043c35337b0eff3e  ./model-q8_0.bin
5262f0fd711ab1af043c35337b0eff3e  ./model-q8_0_v2.bin
matteoserva commented 2 days ago

@bfroemel Thanks for testing. Could you try this one? (I don't have a pc available right now)

Matteo has 20 apples, he buys 20 oranges. Then he discards half of his fruits equally. Then he discards a quarter of his fruits equally between apples and oranges. How many apples remain?

The model should answer 7 or 8.

bfroemel commented 2 days ago

This one is already passed even with the wrong tokenizer conversion where it ended up with 7 apples. Now with the tokenizer conversion fix, the output is very similar compared to aistudio - maybe it's even the same (at temperature 0):

$ ./llama-cli -m /models/model-q8_0_v2.bin  --temp 0 --top-p 0.95 -c 8192 -p "<start_of_turn>user\nMatteo has 20 apples, he buys 20 oranges. Then he discards half of his fruits equally. Then he discards a quarter of his fruits equally between apples and oranges. How many apples remain?<end_of_turn>\n<start_of_turn>model\n" --verbose-prompt -ngl 99
user
Matteo has 20 apples, he buys 20 oranges. Then he discards half of his fruits equally. Then he discards a quarter of his fruits equally between apples and oranges. How many apples remain?
model
Here's how to solve the problem step-by-step:

1. **Total Fruits:** Matteo starts with 20 apples + 20 oranges = 40 fruits.

2. **First Discard:** He discards half, which is 40 fruits / 2 = 20 fruits. This leaves him with 40 fruits - 20 fruits = 20 fruits.

3. **Fruits After First Discard:** He now has 10 apples and 10 oranges.

4. **Second Discard:** He discards a quarter of his fruits, which is 20 fruits / 4 = 5 fruits.

5. **Final Apple Count:** Since he discards 5 fruits equally between apples and oranges, he loses 5 fruits / 2 = 2.5 apples. Since you can't have half an apple, we'll round down. This leaves him with 10 apples - 2 apples = 8 apples.

**Answer:** Matteo has 8 apples remaining.
 [end of text]
matteoserva commented 2 days ago

@bfroemel These results are exactly what I expected. I can't test this until this evening (UTC) but I'm really happy!

bartowski1182 commented 2 days ago

Can you confirm (or I will when I'm at my computer) that saving the embed and output weights in bf16 still works with CUDA offloading?

ngxson commented 2 days ago

@oldmanjk I have no problem parsing tokenizer.json with python json.loads. Maybe that's IDE problem

bfroemel commented 2 days ago

@bartowski1182 You mean something like this?

./llama-quantize --token-embedding-type bf16 --output-tensor-type bf16 /models/model-bf16.bin /models/model-q8_0_v3.bin Q8_0

Unfortunately, already requires bf16 CUDA support :/

ggml_cuda_init: found 5 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 3: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
  Device 4: NVIDIA GeForce RTX 2080, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    1.36 MiB
llm_load_tensors: offloading 46 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 47/47 layers to GPU
llm_load_tensors:        CPU buffer size =  2250.00 MiB
llm_load_tensors:      CUDA0 buffer size =  7459.66 MiB
llm_load_tensors:      CUDA1 buffer size =  6885.84 MiB
llm_load_tensors:      CUDA2 buffer size =  7459.66 MiB
llm_load_tensors:      CUDA3 buffer size =  2869.10 MiB
llm_load_tensors:      CUDA4 buffer size =  3971.48 MiB
........................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   832.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   768.00 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =   832.00 MiB
llama_kv_cache_init:      CUDA3 KV buffer size =   320.00 MiB
llama_kv_cache_init:      CUDA4 KV buffer size =   192.00 MiB
llama_new_context_with_model: KV self size  = 2944.00 MiB, K (f16): 1472.00 MiB, V (f16): 1472.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     3.91 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      CUDA0 compute buffer size =   710.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   710.01 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =   710.01 MiB
llama_new_context_with_model:      CUDA3 compute buffer size =   710.01 MiB
llama_new_context_with_model:      CUDA4 compute buffer size =   710.02 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   137.02 MiB
llama_new_context_with_model: graph nodes  = 1850
llama_new_context_with_model: graph splits = 6
GGML_ASSERT: ggml/src/ggml-cuda.cu:1257: to_fp32_cuda != nullptr
Aborted
bartowski1182 commented 2 days ago

That's exactly what I was looking for, thank you for confirming @bfroemel

bfroemel commented 2 days ago

Anyone knows whether there is a quick way to do the same perplexity benchmark (as llama-perplexity with wikitext-2-raw/wiki.test.raw) on aistudio/API? Would be interesting to have some measure how far we are still off from the model running under reference conditions.

bartowski1182 commented 2 days ago

seems unlikely sadly, best bet would be attempting a 0 temperature side by side?