Investigate gemma 2 generation quality

ngxson commented 3 months ago

Initial reports can be seen from https://github.com/ggerganov/llama.cpp/pull/8227

[!IMPORTANT]
A note for everyone: if you think there's a bug in llama.cpp tokenizer, please make sure to test with HF transformers library first (see this comment for example)

oldgithubman commented 3 months ago

@oldmanjk I have no problem parsing tokenizer.json with python json.loads. Maybe that's IDE problem

I haven't tried parsing it myself, but the file definitely has an unexpected line break and I'm not sure what's supposed to be there

bfroemel commented 3 months ago

After playing a bit, I am not able to pick up any glaring quality degradations between aistudio (27b-it, bf16) and llama.cpp (27b-it, f16 and q8_0) anymore. Sure there are differences in the output, for example, questions like: Q: Translate the following sentence to French: 'The cat is sleeping on the mat.' Q: Concisely explain the concept of polymorphism in object-oriented programming. do not produce a word-for-word same output, but the output is essentially equivalent. Maybe the aistudio output is more verbose, often offering an additional explanation (maybe it's not even the release version of Gemma-2?).

Regarding bf16, here the perplexity benchmark (run on CPU only). The difference compared to f16 is very small.

bf16

perplexity: 1088.88 seconds per pass - ETA 10 hours 35.17 minutes
[1]7.7094,[2]5.2276,[3]5.3925,[4]5.6868,[5]5.4701,[6]5.7165,[7]6.2271,[8]5.9412,[9]5.8135,[10]5.4374,[11]5.7804,[12]5.9094,[13]5.7946,[14]5.6367,[15]5.7082,[16]5.6361,[17]5.5618,[18]5.6333,[19]5.6618,[20]5.7236,[21]5.6956,[22]5.7388,[23]5.8719,[24]5.9153,[25]5.9166,[26]5.9503,[27]5.9461,[28]5.9267,[29]5.9074,[30]5.8683,[31]5.8762,[32]5.8567,[33]5.8881,[34]5.8404,[35]5.8649,
Final estimate: PPL = 5.8649 +/- 0.03706

f16

perplexity: 23.35 seconds per pass - ETA 13.62 minutes
[1]7.7074,[2]5.2283,[3]5.3945,[4]5.6882,[5]5.4710,[6]5.7171,[7]6.2283,[8]5.9424,[9]5.8150,[10]5.4388,[11]5.7821,[12]5.9112,[13]5.7966,[14]5.6388,[15]5.7105,[16]5.6385,[17]5.5642,[18]5.6354,[19]5.6638,[20]5.7254,[21]5.6974,[22]5.7404,[23]5.8738,[24]5.9174,[25]5.9187,[26]5.9524,
[27]5.9482,[28]5.9288,[29]5.9094,[30]5.8703,[31]5.8780,[32]5.8585,[33]5.8898,[34]5.8421,[35]5.8665,
Final estimate: PPL = 5.8665 +/- 0.03709

I am very happy - thanks everyone! :)

bfroemel commented 3 months ago

@oldmanjk Also not seeing any problem with https://huggingface.co/google/gemma-2-27b-it/blob/main/tokenizer.json . There is no line break, just some uncommon symbols. In vim it looks like this:

Could it be that the screenshotted copy of tokenizer.json (https://github.com/ggerganov/llama.cpp/issues/8240#issuecomment-2200961932) was somehow corrupted? Are there different versions of it released?

EliEron commented 3 months ago

@oldmanjk I have no problem parsing tokenizer.json with python json.loads. Maybe that's IDE problem

I haven't tried parsing it myself, but the file definitely has an unexpected line break and I'm not sure what's supposed to be there

No, it's not a line break. Though it's related. If you look at it in a hex editor you can see that it contains E2 80 A8 which translates to U+2028.

U+2028 is the code point for Line Seperator, which is an old symbol which was intended to be used as a universal line break marker, but in practice it's quite hit and miss in terms of which systems recognize it as such.

JSON does not consider it a line separator, which is why it's valid in JSON, but if it is parsed using different rules it might be considered invalid. JavaScript for instance used to treat it as a new line and thus did not accept it in a string. Though that was changed a couple of years ago specifically to attain consistency with JSON.

oldgithubman commented 3 months ago

@oldmanjk I have no problem parsing tokenizer.json with python json.loads. Maybe that's IDE problem

I haven't tried parsing it myself, but the file definitely has an unexpected line break and I'm not sure what's supposed to be there

No, it's not a line break. Though it's related. If you look at it in a hex editor you can see that it contains E2 80 A8 which translates to U+2028.

U+2028 is the code point for Line Seperator, which is an old symbol which was intended to be used as universal line break marker, but in practice it's quite hit and miss in terms of which systems recognize it as such.

JSON does not consider it a line separator, which is why it's valid in JSON, but if it is parsed using different rules it might be considered invalid. JavaScript for instance used to treat it as a new line and thus did not accept it in a string. Though that was changed a couple of years ago specifically to attain consistency with JSON.

This makes sense. Thank you. I'll update my post on huggingface with this explanation credited to you, if you don't mind

oldgithubman commented 3 months ago

@oldmanjk Also not seeing any problem with https://huggingface.co/google/gemma-2-27b-it/blob/main/tokenizer.json . There is no line break, just some uncommon symbols. In vim it looks like this:

Could it be that the screenshotted copy of tokenizer.json (#8240 (comment)) was somehow corrupted? Are there different versions of it released?

As @EliEron explained, there's supposed to be a "line seperator" there, so I assume @arch-btw 's and my text editors are displaying it correctly (mine shows same as his - I'm using xed). I'm surprised vim doesn't show it. Are you using windows? I'm using linux. I doubt it's corrupted because I downloaded several different copies from different sources and they're all the same (plus @EliEron 's explanation makes sense

Edit - I just installed vim and checked. This is what it shows for me (and seems to be the most-sensible representation so far): (The symbol is partially obscured by the double quote, only fully revealing itself when you put the cursor over it) I've been thinking about learning vim. Is it worth it?

oldgithubman commented 3 months ago

Readme update:

[!IMPORTANT]
Given the model instabilities with SDPA/ FA2, by default, the model inference would utilise eager attention.

https://huggingface.co/google/gemma-2-27b-it/commit/2d74922e8a2961565b71fd5373081e9ecbf99c08

Does anyone know if this update requires requanting? https://huggingface.co/google/gemma-2-27b-it/commit/8a03e86ec981364ed298b84ca247373f94f4ad5f

ngxson commented 3 months ago

Readme update:

Important

Given the model instabilities with SDPA/ FA2, by default, the model inference would utilise eager attention.

https://huggingface.co/google/gemma-2-27b-it/commit/2d74922e8a2961565b71fd5373081e9ecbf99c08

Does anyone know if this update requires requanting? https://huggingface.co/google/gemma-2-27b-it/commit/8a03e86ec981364ed298b84ca247373f94f4ad5f

No. That's a config for inference runtime (pytorch), not applied to llama.cpp

arch-btw commented 3 months ago

This seems relevant to me...is it not? I made a post on your behalf on huggingface. I hope that's okay. If not, let me know

Totally okay @oldmanjk! Thank you for looking into this 👍

I should have added that it was for the 9b instruct model. Though, maybe they use the same tokenizer.json. Also, I'm using Kate editor with the default settings.

Here is the hash:

gemma-2-9b-it/tokenizer.json

SHA256:

727dd643e323b88489ef612616d6fa34608202e2b8a8ef286115b8150a8334d0

oldgithubman commented 3 months ago

Totally okay @oldmanjk! Thank you for looking into this 👍

You're welcome! Thanks for the thanks.

I should have added that it was for the 9b instruct model. Though, maybe they use the same tokenizer.json.

They use the same tokenizer.json, yes.

Also, I'm using Kate editor with the default settings.

Every editor seems to produce different results. IMO, the most logical result came from my vim instance on linux.

Here is the hash:

gemma-2-9b-it/tokenizer.json

SHA256:

727dd643e323b88489ef612616d6fa34608202e2b8a8ef286115b8150a8334d0

That's weird. sha256sum gives me:

"7da53ca29fb16f6b2489482fc0bc6a394162cdab14d12764a1755ebc583fea79"

on each version, which matches huggingface. Your copy is different from official

bfroemel commented 3 months ago

I started MMLU-Pro benchmarks (with https://github.com/chigkim/Ollama-MMLU-Pro) on gemma-2-9b-it (q8_0, with imatrix, from https://huggingface.co/bartowski/gemma-2-9b-it-GGUF) and gemma-2-27b-it (q8_0, my own quant without imatrix). At first, I couldn't believe that in the biology category the smaller model achieved an approx. 10% better result than the larger one, so I also checked against a version of 27b-it q8_0 with imatrix (from https://huggingface.co/bartowski/gemma-2-27b-it-GGUF) and ended with a similar result.

When checking detailed logs, the 27b model gave the correct answer not according to the expected format, for example:

The correct answer is **J. The various land ...

while it should have used:

The correct answer is (J). ...

Possibly, it's just a prompting issue and there may be a much better results, if the benchmark prompt was embedded in some context-forming template (e.g., This is a computer-parsed exam, provide your answer in the exact same format as indicated in the example question-answer pairs.).

/edit: Also note that two runs of the same category benchmark lead to results that varied within approx. 2%, even if run on the exact same model and parameters.


category    | 9b-it-q8_0—i | 27b-it-q8_0 | 27b-it-q8_0-i
--------------------------------------------------------
business    |    0.4968    |   0.5158    |     n/a
law         |    0.3243    |   0.3869    |     n/a
psychology  |    0.6178    |   0.6629    |     n/a
biology     |    0.6165    |   0.5063    |   0.4881
chemistry   |    0.3844    |   0.4161    |     n/a
history     |    0.5066    |   0.5486    |     n/a
other       |    0.5260    |   0.5801    |     n/a
health      |    0.4389    |    n/a      |     n/a
economics   |    0.5687    |    n/a      |     n/a
math        |    0.4012    |    n/a      |     n/a
physics     |    0.4211    |    n/a      |     n/a
computer sc.|    0.2976    |    n/a      |     n/a
philosophy  |    0.3928    |    n/a      |     n/a
engineering |    0.2206    |   0.2415    |     n/a

For 27b I have not results for all categories, because I had to stop after 24h heating up my cellar ;) Anyway, MMLU-Pro and these results should be enough to get a more robust indication about the current model generation quality especially when compared to official numbers, as soon as there are any (e.g., https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro ), or different quants/ improved llama.cpp versions.

Dampfinchen commented 3 months ago

Formatting is a serious issue with the model. It really isn't able to predict the correct formatting using previous responses at all in my use case. It has serious trouble with certain RP formatting styles (Like putting a space after asteriks, wrong usage of quotes, or substituting quotes for asteriks.) It's really strange and I wonder if this is a llama.cpp issue or not. I have noticed similar behavior with L3 8B but it's much, much better there.

bfroemel commented 3 months ago

Below you find the complete prompt and model output. Note that in this MMLU-Pro prompt the model is presented with 5 example question-answer pairs and is supposed to complete the 6th question-answer pair after: "Answer: Let's think step by step.".

If I run this on aistudio (temperature=1.0), I also get formatting issues (three times in a row). If I precede the whole prompt with "This is a computer-parsed exam, provide your answer in the exact same format as indicated in the example question-answer pairs.\n\n" both aistudio and llama.cpp (27b, q8_0) are able to format the answer correctly (three times in a row).

Hence, I currently don't think this is a llama.cpp or bf16/numerical issue. Both llama.cpp and aistudio produce very similar results. At least based on this one prompt, there is no quality difference between reference and a q8_0 quant of the model running on llama.cpp.

Question: Which of the following represents an accurate statement concerning arthropods?
Options: A. They possess an exoskeleton composed primarily of peptidoglycan.
B. They possess an open circulatory system with a dorsal heart.
C. They are members of a biologically unsuccessful phylum incapable of exploiting diverse habitats and nutrition sources.
D. They lack paired, jointed appendages.
Answer: Let's think step by step. Peptidoglycan is known to comprise the plasma membrane of most bacteria, rather than the exoskeleton of arthropods, which is made of chitin, which rules out (A). The answer (C) is false because arthropods are a highly successful phylum. Likewise, arthropods have paired, jointed appendages, which rules out (D). The only remaining option is (B), as arthropods have an open circulatory system with a dorsal tubular heart. The answer is (B).

Question: In a given population, 1 out of every 400 people has a cancer caused by a completely recessive allele, b. Assuming the population is in Hardy-Weinberg equilibrium, which of the following is the expected proportion of individuals who carry the b allele but are not expected to develop the cancer?
Options: A. 19/400
B. 1/400
C. 40/400
D. 38/400
E. 2/400
F. 1/200
G. 20/400
H. 50/400
Answer: Let's think step by step. According to the Hardy Weinberg Law, $p^2 + 2 p q + q^2 = 1$, and $p + q = 1$ where $p$ is the frequency of the dominant allele, $q$ is the frequency of the recessive allele, and $p^2$, $q^2$, and $2pq$ are the frequencies of dominant homozygous, recessive homozygous, and heterozygous individuals, respectively. \u200bThe frequency of the recessive allele (q) is $\\sqrt{\frac{1}{400}} = 0.05$. We have $p = 1 - q = 0.95$. The frequency of heterozygous individuals is $2pq = 2 \\cdot 0.05 \\cdot 0.95 = 0.095$. The number of heterozygous individuals is equal to the frequency of heterozygous individuals times the size of the population, or $0.095 * 400 = 38$. So we end up with 38/400. The answer is (D).

Question: A mutation in a bacterial enzyme changed a previously polar amino acid into a nonpolar amino acid. This amino acid was located at a site distant from the enzyme\u2019s active site. How might this mutation alter the enzyme\u2019s substrate specificity?
Options: A. By changing the enzyme\u2019s pH optimum
B. By changing the enzyme's molecular weight
C. An amino acid change away from the active site increases the enzyme's substrate specificity.
D. By changing the shape of the protein
E. By changing the enzyme's temperature optimum
F. By altering the enzyme's ability to be denatured
G. By changing the enzyme\u2019s location in the cell
H. By changing the enzyme's color
I. An amino acid change away from the active site cannot alter the enzyme\u2019s substrate specificity.
J. By altering the enzyme's rate of reaction
Answer: Let's think step by step. A change in an amino acid leads to a change in the primary structure of the protein. A change in the primary structure may lead to a change in the secondary and the tertiary structure of the protein. A change in the tertiary structure means a change in the shape of the protein, so (C) has to be correct. Since the change does not affect the active site of the enzyme, we do not expect the activity of the enzyme to be affected. The answer is (D).

Question: Which of the following is not a way to form recombinant DNA?
Options: A. Translation
B. Conjugation
C. Specialized transduction
D. Transformation
Answer: Let's think step by step. The introduction of foreign DNA or RNA into bacteria or eukaryotic cells is a common technique in molecular biology and scientific research. There are multiple ways foreign DNA can be introduced into cells including transformation, transduction, conjugation, and transfection. In contrast, (A) is not a way to form DNA: during translation the ribosomes synthesize proteins from RNA. The answer is (A).

Question: Which of the following is not known to be involved in the control of cell division?
Options: A. Microtubules
B. Checkpoints
C. DNA polymerase
D. Centrosomes
E. Cyclins
F. Mitochondria
G. Protein kinases
H. Fibroblast cells
Answer: Let's think step by step. Normal cells move through the cell cycle in a regulated way. At the checkpoint stage, they use information about their own internal state and cues from the environment around them to decide whether to proceed with cell division. Cues like these act by changing the activity of core cell cycle regulators inside the cell. The most common regulators are cyclins and cyclin-dependent kinases. Fibroblast cells do not play any role in cell division. The answer is (H).

Question: Describe the various land biomes that are usually encounteredby a traveler going from the equator to the arcticpolar ice cap.
Options: A. Tropical rain forests, savannas, and tundra
B. Tropical rain forests, deserts, and polar ice caps only
C. Tundra, desert regions, and tropical rain forests
D. Taiga biome, grasslands, and deserts
E. Deciduous forests, tundra, and savannas
F. Deciduous forests, grasslands, and tropical savannas
G. Savannas, taiga biome, and polar ice caps
H. Chaparral, wetlands, and taiga biome
I. Tropical rain forests, montane forests, and arctic tundra
J. The various land biomes encountered while traveling from the equator to the arctic polar ice cap are tropical rain forests, grassland biomes, deciduous forests, taiga biome, tundra, and desert regions.
Answer: Let's think step by step.

The correct answer is **J. The various land biomes encountered while traveling from the equator to the arctic polar ice cap are tropical rain forests, grassland biomes, deciduous forests, taiga biome, tundra, and desert regions.**

Here's why:

* **Tropical Rainforests:** Found near the equator, characterized by high rainfall and biodiversity.
* **Grassland Biomes:**  Occur in regions with moderate rainfall, dominated by grasses and few trees.
* **Deciduous Forests:**  Experience distinct seasons with warm summers and cold winters, trees lose their leaves in the fall.
* **Taiga Biome:**  Also known as boreal forests, found in high-latitude regions with long, cold winters and coniferous trees.
* **Tundra:**  Characterized by permafrost, extremely cold temperatures, and low-growing vegetation.
* **Desert Regions:**  Receive very little rainfall, with extreme temperatures and sparse vegetation. 

Let me know if you'd like more detail on any specific biome!

bfroemel commented 3 months ago

Offtopic for this issue, but in case someone is disappointed by the benchmark results: don't be! it's indeed just prompting. In the meantime I have rerun the biology category of the MMLU-Pro benchmark with a static customized prompt intro which puts, for example, the 27b model in this category right between Llama3-70B-instruct and Qwen2-72B-32K (on https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro).

category    | 9b-it-q8_0—i | 27b-it-q8_0-i
--------------------------------------------------------
biology     |    0.7671    |   0.7824

arch-btw commented 3 months ago

@oldmanjk

Your copy is different from official

Great catch!

And apologies, I had accidentally saved the file while I was trying to investigate what that line was all about, and it changed the SHA. I downloaded a fresh copy and it matches your SHA now.

It does still display that line the same strange way as before though, and different in every text editor indeed!

Very strange, but maybe it's not a problem after reading the comments in this thread.

oldgithubman commented 3 months ago

@oldmanjk

Your copy is different from official

Great catch!

And apologies, I had accidentally saved the file while I was trying to investigate what that line was all about, and it changed the SHA. I downloaded a fresh copy and it matches your SHA now.

It does still display that line the same strange way as before though, and different in every text editor indeed!

Very strange, but maybe it's not a problem after reading the comments in this thread.

Yeah, I think we've confirmed it's not a problem at this point

Rotatingxenomorph commented 3 months ago

Temp 1.0 seems to be a bit too high for Gemma 2 27b. What is the 'natural' temp for this model, does anyone know?

bfroemel commented 3 months ago

1.0 is the default temperature set in aistudio. Did you notice any detrimental effect regarding a temp of 1.0?

MoonRide303 commented 3 months ago

Temp 1.0 seems to be a bit too high for Gemma 2 27b. What is the 'natural' temp for this model, does anyone know?

I've noticed both temperature 0 and 1.0 used in Google code (in gemma.cpp repo):

Rotatingxenomorph commented 3 months ago

1.0 is the default temperature set in aistudio. Did you notice any detrimental effect regarding a temp of 1.0?

It seemed to have some trouble with numbers/math at temp 1.0.

gemma-2-27b-it-Q8_0.gguf --top-k 0 --min-p 0.0 --top-p 1.0 --color -t 5 --temp 1 --repeat_penalty 1 -c 4096 -n -1 -ngl 14 --conversation -i

at temp 1.0 I get this:

how many years did aliens come out before alien 3?

Aliens (the sequel to Alien) came out 7 years before Alien 3.

Here's the breakdown:

Alien: 1979
Aliens: 1986
Alien 3: 1992

at temp 0 I get:

how many years did aliens come out before alien 3?

"Aliens" was released in 1986.

"Alien 3" was released in 1992.

Therefore, "Aliens" came out 6 years before "Alien 3".

Let me know if you have any other movie trivia questions!

Rotatingxenomorph commented 3 months ago

I tried it on gemini flash 1.5 api at temp 1 AND temp 0 and it also got it wrong, so I guess llamacpp is off the hook!

Screenshot 2024-07-05 at 14-57-39 Chat with Open Large Language Models

ggerganov commented 3 months ago

Btw, for this kind of queries that require known facts you should always use temp == 0.0f

eskeletor97 commented 3 months ago

https://github.com/huggingface/transformers/pull/31775 is this relevant to llama.cpp implementation?

matteoserva commented 3 months ago

I think it's already correct in llama.cpp (feel free to correct me if I'm wrong): https://github.com/ggerganov/llama.cpp/blob/be20e7f49d9e5c6d9e8d9b4871eeba3df7a1639d/src/llama.cpp#L11572

steampunque commented 3 months ago

Use proper nouns, it helps the model know what you are talking about.

lm how many years did Aliens come out before Alien 3?
Here's the breakdown:

* **Aliens** was released in 1986.
* **Alien 3** was released in 1992.

Therefore, **Alien 3** came out **6 years** after **Aliens**.

Or just multiturn, should work fine, model will create proper nouns in context.

bash-5.1$ lm when did movies aliens and  alien 3 come out?
Here are the release dates for the movies you asked about:

* **Alien** - June 25, 1979
* **Aliens** - July 18, 1986
* **Alien 3** - May 11, 1992 

Let me know if you have any other movie release dates you'd like to know! 

bash-5.1$ lmc how many years did aliens come out before alien 3?
Aliens was released in 1986 and Alien 3 in 1992.  

There are **6 years** between the release of Aliens and Alien 3.

Rotatingxenomorph commented 2 months ago

Use proper nouns, it helps the model know what you are talking about.

Hah! I first got the problem in the context of it writing an essay about Alien 3, but I couldn't reproduce it. I think another part of it might be that Alien was released 7 years before Aliens, so maybe that's where the network is getting that urge from?

cuelebra commented 2 months ago

Here are two prompts that were run at 0 temp with Gemma 27B Q8_0 https://pastebin.com/9UCkX201 You can remove the last output of the model up to model and test it yourself

a difference is the second prompt has one more paragraph of lorem ipsum, but in fact, just adding a linebreak to the last paragraph causes a degradation of formatting and coherence identically ("Bard" instead of "Gemma", double space instead of single space in two places)

BugReporterZ commented 2 months ago

There have been reports that using a higher f_final_logit_softcapping than the default value of 30 (e.g. 50) may solve certain quality issues on Gemma-2-27B, has anybody tried? It would be useful if this value (and possibly that of f_attn_logit_softcapping as well) could be changed without requantizing the model.

cuelebra commented 2 months ago

@BugReporterZ i set final_logit_softcapping to 50 in config.json, and just in case replaced default 30.0f with 50.0f in llama.cpp file, requantized the model - the output for above prompts was unaffected

ggerganov commented 2 months ago

For anyone running tests relying on context shift, make sure to try https://github.com/ggerganov/llama.cpp/pull/8348 since there was a bug that affected the quality of context shifts for Gemma2 models

cuelebra commented 2 months ago

someone found out that the llama.cpp gemma2 tokenizer splits a certain word in multiple tokens, which is defined as a single token in tokenizer.json. Is this expected or not?

curl -X POST -H "Content-Type: application/json" -d '{"content":"[toxicity=0]"}' http://localhost:8080/tokenize
{"tokens":[235309,1373,235293,235276,235307]}

ngxson commented 2 months ago

@cuelebra The mentioned token probably isn't used by gemma (maybe google reuse the same tokenizer for other models).

HF transformers outputs the same thing:

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b")
tokenizer("[toxicity=0]")
# [2, 235309, 1373, 235293, 235276, 235307]

This token need to be marked as special token to make it work, but that's not the case, see: https://huggingface.co/google/gemma-2-9b/blob/main/tokenizer_config.json

ngxson commented 2 months ago

[!IMPORTANT]
A note for everyone: if you think there's a bug in llama.cpp tokenizer, please make sure to test with HF transformers library first (see my comment above for example)

AUTOMATIC1111 commented 2 months ago

This is a difference between how the corporate hosted implementation and llamacpp work. If it's different for this particular token, maybe there are other cases for which tokenization is different from how google trained the model. It's entirely possible that the transformers implementation of the tokenizer for gemma is not correct, especially considering they had other bugs with implementation already.

oldgithubman commented 2 months ago

Btw, for this kind of queries that require known facts you should always use temp == 0.0f

What is the 'f' for?

AUTOMATIC1111 commented 2 months ago

for letting everyone know that it's a single precision floating point number

compilade commented 2 months ago

HTML tags are not yet tokenized correctly by Gemma-2's tokenizer in llama.cpp. I think I managed to fix this in #8228, but it unfortunately requires re-converting Gemma models with the changes from that branch, see https://github.com/ggerganov/llama.cpp/pull/8228#issuecomment-2213014331

oldgithubman commented 2 months ago

HTML tags are not yet tokenized correctly by Gemma-2's tokenizer in llama.cpp. I think I managed to fix this in #8228, but it unfortunately requires re-converting Gemma models with the changes from that branch, see #8228 (comment)

Are you guys planning to merge that branch or am I waiting around like an idiot for nothing? I see related changes happening elsewhere. Just wondering when I can re-convert. Again, let me know if there's anything I can do to help speed it up

progmars commented 2 months ago

Formatting is a serious issue with the model. It really isn't able to predict the correct formatting using previous responses at all in my use case.

That's my experience, too. My instructions had clear directions to use * (asterisks) for actions and I had dialog examples. Gemma stubbornly kept using quotes around speech and did not use asterisks around actions, and kept using double newlines between paragraphs. After a dozen of messages (which I corrected manually), Gemma finally stopped using quotes and started using asterisks correctly. However, nothing helped against double newlines. I haven't yet seen such a stubborn LLM, when it comes to formatting.

ggerganov / llama.cpp

Investigate gemma 2 generation quality #8240