Llama3 GGUF conversion with merged LORA Adapter seems to lose training data randomly

Sneakr commented 6 months ago

I'm running Unsloth to fine tune LORA the Instruct model on llama3-8b .

1: I merge the model with the LORA adapter into safetensors 2: Running inference in python both with the merged model directly or the unsloth loaded model with the adapter on top of it produces correct outputs as per the fine tune

Bug: GGUF conversion of the merged model does not produce the same output. The GGUF has lost some of its fine tune data, while still maintaining most of it.

I can ask it who it is, who created it etc. And it responds Llama and Meta as usual, but it incorporates the fine tuned speech style and humor into the response. This is not the case for my fine tuned model.

1: I tried merging the LORA adapter with the original GGUF (non-fine tuned) using llama.cpp, the same results. 2: I tried running the server on the original GGUF (non-fine tuned) usling llama.cpp server and the adapter loaded into the server terminal command - same results.

It seemes that GGUF conversion is losing fine tuned data randomly during conversion.

If this is the case, all GGUF converts of the fine tuned models are basically out the window. And the question is how much the non-fine tuned models are affected by this.

I've tried F16, Q8, same issues.

This is not a quantization issue as I get the exact same results running FP16 as well as 4-bit in python running HF loader or Unsloth, both works fine as mentioned.

JohannesGaessler commented 6 months ago

@Sneakr for reference, can you post the exact steps you took for creating a GGUF file from your Unsloth LoRA? Obviously somewhere in the pipeline something went wrong but the question is where.

Sneakr commented 6 months ago

@Sneakr for reference, can you post the exact steps you took for creating a GGUF file from your Unsloth LoRA? Obviously somewhere in the pipeline something went wrong but the question is where. Sure:

Step 1: (Tested both with and without unsloth and HF AutoModel, both had the same outcome)

    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = lora_model,
        max_seq_length = max_seq_length,
        dtype = torch.bfloat16,
        load_in_4bit = False,
    )

    model = model.merge_and_unload()
    model.save_pretrained(save_dir)

Step 2: CUDA_VISIBLE_DEVICES="" python ./llama.cpp/convert-hf-to-gguf.py ./xmerge/NewModel --outfile ./xmerge/NewModel/NewModel_F32.gguf --outtype f32

The Lora was tested in bfloat16 training as well as QLora 4bit , produces same outcome.

Sneakr commented 6 months ago

@JohannesGaessler IT seems altough it fixed things for improvement using the template , there's still issues. I compared the answers by the inference running output_ids = model.generate and it is more in line with my fine tuning, while ooba seems still to be losing like a huge portion of the fine tuning still.

This is really really wierd. I hope we can get more eyes on this issue. I'm taking a break now.

oldgithubman commented 6 months ago

Given the new evidence I'm thinking this could be an issue with tokenization. Can you check llama.cpp vs. llama.cpp_hf in Oobabooga?

Also just to make sure: you are testing with temperature 0 in order to rule out issues with different sampling settings, right?

Sorry to kind-of hijack, but I've been wondering this for a while. Is there any practical difference between llama.cpp vs llama.cpp_hf? Should I be favoring one over the other?

oldgithubman commented 6 months ago

I don't know if it's related, but on HF some people have suggested changes to config.json and tokenizer_config.json. Wondering if you're aware of them. I've been using them.

config.json: change line 8 to: "eos_token_id": [128001, 128009],

tokenizer_config.json: change line 2055 to: "eos_token": "<|eot_id|>",

JohannesGaessler commented 6 months ago

Sorry to kind-of hijack, but I've been wondering this for a while. Is there any practical difference between llama.cpp vs llama.cpp_hf?

My understanding is that the llama.cpp loader uses the tokenizer and sampling provided by llama.cpp while llama.cpp_HF uses those provided by HuggingFace.

Should I be favoring one over the other?

In principle, assuming both work correctly, I favor the llama.cpp loader since it is simply faster. In this particular case for some reason the tokenization seems to become wrong when going from Unsloth to GGUF. In addition to that, the Oobabooga llama.cpp loader seems to get the tokenization wrong (this does not seem to happen when using llama.cpp directly).

Sneakr commented 6 months ago

@JohannesGaessler I ran your test with your system prompt on model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

This is the output I got on 0.1 temp with output_ids = model.generate (could not run 0 in inference)

As you see this is clearly different from the responses you got from ooba and llama.cpp . Could you verify this running inference with code and not through llama.cpp ?

Edit: I think this answer is more logical ,as it knows and recognize the downloading of 3d models etc.

JohannesGaessler commented 6 months ago

This is the output I got on 0.1 temp

0.1 temperature is I think still too high to rule out random differences from sampling. Can you set top-k 1?

As you see this is clearly different from the responses you got from ooba and llama.cpp . Could you verify this running inference with code and not through llama.cpp ?

Using the original weights and 0.01 temperature I do not get consistent results in Oobabooga with the Transformers loader. While this could be an effect of the non-zero temperature what I think is more likely is that Transformers internally uses atomic adds for better performance. Edit: this still happens with top-k 1. The atomic adds cause the order in which floating point operations are done to be undefined and therefore results in small amounts of noise on the results of individual matrix multiplications and therefore nondeterministic results (e.g. ExLlama to my knowledge has the same behavior). One of the possible outputs almost exactly matches the llama.cpp_HF loader result:

``` <|begin_of_text|><|start_header_id|>system<|end_header_id|> A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.<|eot_id|><|start_header_id|>user<|end_header_id|> how 2 download a car<|eot_id|><|start_header_id|>assistant<|end_header_id|> I think there might be a slight misunderstanding! You can't actually download a car, as it's a physical object that exists in the real world. Cars are manufactured and sold by companies, and they can't be transferred or downloaded digitally. However, if you're asking about how to purchase or acquire a car, I'd be happy to help with that! There are several ways to buy a car, depending on your location and preferences. Here are a few options: 1. Visit a car dealership: You can visit a local car dealership and test drive a car to see if it's the right fit for you. The salesperson will guide you through the purchasing process, and you can negotiate the price and terms of the sale. 2. Online marketplaces: Websites like Cars.com, Autotrader.com, or Craigslist.org allow you to browse and purchase cars from private sellers or dealerships. Be cautious when buying from a private seller, and make sure to inspect the car thoroughly before making a purchase. 3. Auctions: You can also attend car auctions, either in-person or online, to bid on cars. This can be a great way to find a deal, but be prepared to act quickly and make a decision on the spot. 4. Leasing: If you don't want to own a car outright, you can consider leasing one. This involves paying a monthly fee to use the car for a set period of time, usually 2-3 years. Remember to do your research, read reviews, and test drive a car before making a purchase. It's also a good idea to have a mechanic inspect the car before buying, especially if you're purchasing from a private seller. I hope that helps clarify things! Let me know if you have any other questions.assistant I'm glad I could help clarify things for you. It's important to remember that cars are physical objects that can't be downloaded or transferred digitally. If you're in the market for a new set of wheels, I recommend doing your research, reading reviews, and test driving a few cars before making a decision. If you have any more questions or need further assistance, feel free to ask! I'm here to help.assistant You're welcome! It was my pleasure to assist you. I hope you find the perfect car that meets your needs and budget. Remember to always prioritize your safety and well-being when purchasing a vehicle, and don't hesitate to reach out if you have any more questions or concerns. By the ```

Notably one of the other responses was very close to the llama.cpp loader result with incorrect tokenization:

``` <|begin_of_text|><|start_header_id|>system<|end_header_id|> A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.<|eot_id|><|start_header_id|>user<|end_header_id|> how 2 download a car<|eot_id|><|start_header_id|>assistant<|end_header_id|> I think there might be a slight misunderstanding! You can't actually download a car, as it's a physical object that exists in the real world. Cars are manufactured and sold by companies, and they can't be transferred or downloaded digitally. However, if you're asking about how to purchase or acquire a car, I'd be happy to help with that! Here are some general steps you can follow: 1. Research: Look up different car models, read reviews, and compare features to find the one that suits your needs and budget. 2. Check availability: Visit a dealership or check online marketplaces to see if the car you want is available in your area. 3. Test drive: Take the car for a spin to get a feel for how it handles and to ensure it's a good fit for you. 4. Financing: Explore financing options, such as loans or leases, to determine what works best for you. 5. Purchase: Once you've found the right car and secured financing, you can finalize the purchase and take ownership of the vehicle! If you have any specific questions about the process or need help with a particular step, feel free to ask!assistant I'm glad I could help clarify things for you. It's not possible to download a car, but I'm happy to assist you with the process of purchasing or acquiring one. If you have any more questions or need further guidance, please don't hesitate to ask. I'm here to help!assistant That's completely understandable! It can be a bit overwhelming when trying to navigate the process of buying a car. If you have any more questions or concerns, feel free to ask and I'll do my best to help. In the meantime, you might want to consider doing some research on your own to get a better sense of what you're looking for in a car. You can read reviews, compare features, and check out different models to get a sense of what's out there. Additionally, you might want to consider working with a car-buying service or a trusted friend or family member who has experience with buying cars. They can help guide you through the process and provide valuable insights and advice. Remember, buying a car is a big decision, and it's okay to take your time and do your research. Don't be afraid to ask questions or seek help when you need it. Good luck!assistant That's a great idea! Researching and doing your due diligence can really help you make an informed ```

JohannesGaessler commented 6 months ago

Anyways, to recapitulate my current position: There probably are tokenization issues somewhere in the Unsloth -> GGUF pipeline. I still do not accept the llama ASCII art test as evidence that there is something fundamentally wrong with the llama.cpp inference code or the GGUF file format. I think all it proves is that the results are not bit-for-bit identical. Even the original LLaMA 3 Instruct 8b weights with the Transformers loader can produce wildly different outputs due to what I assume are small differences in rounding error from atomic adds.

Sneakr commented 6 months ago

@JohannesGaessler The GGUF file format was not the issue since AWQ in ooba produced the same issue, so it's probably a tokenization issue, the quesiton is where and what.

Here's the output changing top_k and temp to 0.01:

I still think this is a more logical response than giving a step by step guide on how to buy a car when you ask for download. Don't you think?

And this is the instruct model from META no fine tunes.

JohannesGaessler commented 6 months ago

Whether or not a single response is subjectively more "logical" is completely irrelevant. Changing the inference code is going to lead to different results. And as long as the changes aren't extremely large you would need to investigate a sample size of at least thousands of responses in order to draw statistically significant conclusions.

JohannesGaessler commented 6 months ago

so it's probably a tokenization issue, the quesiton is where and what.

I would suggest you check the tokenization in Ooba and compare it then.

Sneakr commented 6 months ago

@JohannesGaessler

Of course we don't draw a conclusion from a mere single prompt. It was just to state the obvious, llama.cpp produces different outputs compared to loading the model directly, both for the fine tunes and as showing this single prompt on the non-tuned instruct model.

For now I have only tested the fine tuned models, and the changes are bigger, and we can conclude that this is not a GGUF only issue. AWQ works perfectly in code inference, but not on ooba, on ooba it produces the same broken output.
Changing the prompt template in ooba produces slightly better results, but far from the expected ones as I referenced previously.

We need more people testing for themselves to draw a better and more grounded conclusion to where the issue is. Merely pointing at somethinig without direct evidence is just pure speculation, I want to get a grip of the issue and pinpoint it so we know for sure. However, we have pinpoint it thanks to your testing that it could be something with the tokenization.

JohannesGaessler commented 6 months ago

it could be something with the tokenization.

As I said, check the tokenization then. If the vector of tokens going into the model is the exact same, then tokenization has nothing to do with it.

olinorwell commented 6 months ago

Personally I would focus all efforts on reproducing the issue using pure llama.cpp built on the command line from the latest commit, and leave Ooba and other front-ends for the moment.

Given the issues/confusion surrounding <|eot_id|> since the L3 release it risks introducing noise into what could be an important bug fix. Ooba and others have had problems with L3 that were mostly fixed by manually configuring that stop token.

If you need any additional testing perhaps create a git repo that others can clone and run locally. I am a user of Unsloth too and would be keen to pinpoint what exactly is going on here. My fear is that we're combining a bug, a known about bug and randomness into one bug report which will be very hard to resolve to everyone's liking.

Sneakr commented 6 months ago

@olinorwell exactly my point! Thanks for clarifying , simple speculation and pointing in random directions don't lead anywhere to solve a potential bug that is important to solve.

Here's a colab on the fingerprint test, Daniel is working on more colabs to reproduce the issue will update here when I got more info: https://colab.research.google.com/drive/1djwQGbEJtUEZo_OuqzN_JF6xSOUKhm4q?usp=sharing

JohannesGaessler commented 6 months ago

Personally I would focus all efforts on reproducing the issue using pure llama.cpp built on the command line from the latest commit, and leave Ooba and other front-ends for the moment.

I think this depends on what you're trying to investigate. Ooba allows you to use the exact same code for tokenization and sampling so you can do A/B testing of only the actual inference code. The llama.cpp_HF results that I get for multiple prompts are consistent with the inherent nondeterminism of Transformers, i.e. floating point rounding error (when using FP16 for both). The pattern is the same as for the llama ASCII art test: the sequences are the same for some time but then they randomly sample a single different token at which point they diverge. If you use BF16 for Transformers the divergence happens earlier but I very much do not expect that either data type is going to be statistically significantly better in any meaningful way. If anything it's going to be FP16 that performs better because there is less rounding error for the calculations. The rounding error of converting subnormal BF16 weights to FP16 is negligible for a matrix multiplication. And values larger than the max. representable FP16 value are just going to cause NaNs.

So assuming that Huggingface Transformers produces correct results then llama.cpp_HF also produces correct results. This then only leaves tokenization and sampling. Greedy sampling is so simple that I would be extremely surprised if there were any issues with it. And tokenization can simply be checked. If there are no issues with that either then as far as I am concerned there are no actual issues.

Edit: no actual issues with GGUF models converted from HF format.

Sneakr commented 6 months ago

@JohannesGaessler I don't see yet where you tried fine tuning LORA anywhere? Did I miss something?

So assuming that Huggingface Transformers produces correct results then llama.cpp also produces correct results.

Except, it doesnt. And that's the reason this whole thread has been opened and many people are investigating this as of these moment and they all conclude the same results, except you, because you don't want to test out anything but throw assumptions? As much as I appreciate your time and effort, let's keep this thread clean of assumptions now because the fact is you are not willing to test the fine tune because you delcared your position as quote:

Sorry, but I disagree. I don't need to present any evidence myself in order to express that I disagree with the conclusions drawn from the evidence that other people present.

This is not about a conspiracy theory. We are in llama.cpp github repo and there's an obvious difference with inference with torch and HF directly through python and a completely different outcome using llama.cpp. If you can't accept the fact , let's keep this thread clean from speculation and mere assumptions since you are not willing to experiment yourself as you see "it is up to use to provide evidence."

JohannesGaessler commented 6 months ago

Let me remind you of the title of this issue:

Llama3 GGUF conversion with merged LORA Adapter seems to lose training data randomly

The specific claim made here is that this is a llama.cpp/GGUF issue and that essentially the numerical results of the token probabilities given a prompt are incorrect. I am not observing any differences beyond rounding error for LLaMA 3 Instruct 8b FP16 between HF Transformers and llama.cpp. As long as llama.cpp_HF and HF Transformers are consistent then they can only be both correct or both incorrect in the exact same way. I don't need to train any LoRAs or do any finetuning because once you merge the LoRA with a given model all that changes are the model weights. And beyond numerical issues the specific model weights do not affect the correctness of the results.

I've already said it multiple times but you simply cannot expect bit-for-bit identical results from neural networks if you change the inference software. llama.cpp results being different from PyTorch results is not a bug but an inevitable consequence of floating point arithmetic.

Sneakr commented 6 months ago

@JohannesGaessler

That was the original claim many man-hours ago. Please, let's not turn this into a debate where the goal is to convince you a single individual of something as this is not in my interest. This thread has 70+ comments, many people has been investigating the issue since before this thread even opened, and we have concluded that it is something else, but still an issue that we cant pinpoint.

If you don't agree there's any issue here, glad, thank you move on. Thanks for your input and your efforts.

Cheers.

Edit:

inevitable consequence of floating point arithmetic.

Not really, this has been tested in F32. AWQ 4 bits as well as other formats.

Inference at 4bit produces the exact same results in code without llama.cpp as expected. This is not an "inevitable consequence of floating point arithmetic".

And presumably Meta has thrown a lot more compute and training data at their instruct model than you did for your LoRA. My expectation therefore would be that given even a slight perturbation of the results the model reverts back to the Meta finetune behavior.

Let's not dance in circles now. I assume I don't have the "Meta" horse power to fine tune a model and that Meta's magical model can regrow it's layers and training back to its original . It seems that Meta and Llama 3 finally solved the "catastrophig forgetting" issue that is present with fine tuning and training pre-trained models, as they can grow the data back to its original.

abc-nix commented 6 months ago

Can someone share a gguf file for testing? If created through Unsloth even better. I don't know how to download the gguf file from the colab, so if it can be shared on huggingface and linked here it would be great (the fingerprint test could be interesting to test). And also please provide the exact prompt that should be tested and the expected output.

I am comparing server output between llama.cpp (OAI API chat completion) and mistral.rs (that uses the candle library from huggingface) for the same Meta-Llama-3-8B-instruct q8 gguf file. Inspired by @JohannesGaessler's test, this is the command I send to the API chat completion:

``` curl http://localhost:1234/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer EMPTY" \ -d '{ "model": "", "frequency_penalty": 0, "top_p": 0, "temperature": 0, "seed": 0, "messages": [ { "role": "system", "content": "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to any question." }, { "role": "user", "content": "how 2 download a car" } ] }' ```

On both (llama.cpp server running with cublas and mistral.rs server running on CPU) I get the exact same output (magically).

I think there may be a bit of confusion here! Unfortunately, it's not possible to download a car, as it's a physical object that exists in the real world and can't be transferred digitally.\n\nCars are complex machines that require assembly, manufacturing, and testing before they can be driven on the road. They also require a physical space to be stored and maintained.\n\nIf you're looking to purchase a car, I'd be happy to help you with that! You can explore various options such as visiting a dealership, browsing online marketplaces, or checking out local listings.\n\nIf you're looking for a virtual or digital representation of a car, there are some options available. For example, you can find digital car models or simulations online, or even play car racing games. However, these are not the same as owning a physical car.\n\nLet me know if there's anything else I can help you with!

So, with the exact same GGUF file on different inference engines, I get the same results. mistral.rs I believe uses the same llama-3 tokenizer I direct it to (tokenizer.json downloaded from the Nous Research huggingface repo for llama-3 8B Instruct). I am curious to see if an Unsloth LORAd gguf also has the same results on both inference engines. If so, maybe it isn't llama.cpp, but the method of creating the gguf file that has the issue.

gilbertgong commented 6 months ago

When I fed the same prompt to the llama.cpp tokenize binary I get the correct tokenization:
128000 -> '<|begin_of_text|>'
128000 -> '<|begin_of_text|>'
128006 -> '<|start_header_id|>'
  9125 -> 'system'
128007 -> '<|end_header_id|>'
   271 -> '

'
    32 -> 'A'
  6369 -> ' chat'
  1990 -> ' between'
   264 -> ' a'
 22999 -> ' curious'
  1217 -> ' user'
   323 -> ' and'
   459 -> ' an'
 21075 -> ' artificial'
 11478 -> ' intelligence'
 18328 -> ' assistant'
    13 -> '.'
So these are possibly two different issues. But in any case, I think it's worthwhile to check that the prompt you're using for testing is being properly tokenized.

@JohannesGaessler Any idea why tokenize binary gives correct tokenization while llama.cpp does not (if I understand what you're saying correctly?) Have you opened a separate issue to track that? Seems like regardless of its relation to this issue, that's something that needs to be fixed?

JohannesGaessler commented 6 months ago

If I remember correctly Ooba internally uses llama-cpp-python bindings. If I had to guess the issue is either that the version of said bindings is too old or that they need to be adapted for the BPE tokenizer fixes in llama.cpp. In any case, I have already opened an issue on the Oobabooga Github: https://github.com/oobabooga/text-generation-webui/issues/5983

gilbertgong commented 6 months ago

@JohannesGaessler

So just to confirm, are you saying the issue you saw where conversely tokenize created correct results was specific to oobabooga? I wasn't entirely clear, looking back you had labeled the below tokenization output as "llama.cpp" but I am now guessing you meant through oobabooga, and you expect that using llama.cpp directly does not produce incorrect tokenization?

ref:

llama.cpp

27     -  '<'
91     -  '|'
7413   -  'begin'
3659   -  '_of'
4424   -  '_text'

danielhanchen commented 6 months ago

Hi so I managed to test HF -> llama.cpp without Unsloth to remove Unsloth from the picture.

'\n\n' is tokenized as [1734, 1734], unless if I prompted it incorrectly.
[1734] using tokenizer.batch_decode([1734]) returns \\n.
Ie llama.cpp is tokenizing \n\n as \\n\\n.
Using HF directly, we get: \\n = 1734 \n = 198 \n\n = 271 \n\n\n = 1432 4\n = 1038 5\n = 14963 6\n = 5244 7\n = 35683 8\n = 6087 9\n = 55160

See reproducible notebook: https://colab.research.google.com/drive/1aNS8CgXoJZHclBEW3ZjFfiLjpmqZ14KN?usp=sharing

Below is the comparison of tokenization differences between llama.cpp and HF:

I also used convert.py which I'm assuming is not anyways supposed to work (maybe). I chose --vocab-type bpe. Reproducible example: https://colab.research.google.com/drive/1X8XBdLRf1-eRDSfcr_GrIhaf84Wp9FH1?usp=sharing

Sadly convert.py is even worse, splitting the newlines into 2 distinct characters:

gabriel-peracio commented 6 months ago

I just ran the same unsloth-generated GGUF (fp16 trained, fp16 saved) with ./mistralrs_server -n 20 -i gguf -m bartowski/Meta-Llama-3-8B-Instruct-GGUF -f ~/Downloads/model-unsloth.F16.gguf -t leliuga/Meta-Llama-3-8B-Instruct-bnb-4bit

That is, using mistral.rs

That very same GGUF produces this:

So yeah, this all but confirms that the bug is not in the GGUF, but in llama.cpp

gabriel-peracio commented 6 months ago

@abc-nix Here is your fingerprint GGUF as requested: https://huggingface.co/fimbulvntr/llama-3-instruct-fingerprint-fp16/tree/main

This is the same GGUF that I've been using for most of those videos, and the same one that worked under mistral.rs

Trigger it with a simple !!llama.cpp!! prompt

Sneakr commented 6 months ago

@gabriel-peracio Now this is the kind of experiments and discussions that leads the open source community forward for the betterment! :) Great job!!

JohannesGaessler commented 6 months ago

@gilbertgong Yes, I specifically meant the Oobabooga "llama.cpp" loader.

@danielhanchen these are the results I get with llama.cpp tokenize and main:

tokenize

``` johannesg@johannes-romed82t-00 ~/Projects/llama.cpp [21:26:16] > $ ./tokenize models/opt/llama_3_instruct-8b-f16.gguf "<|begin_of_text|><|start_header_id|>user<|end_header_id|> [±master ●●(✹)] \!\!llama.cpp\!\!<|eot_id|><|start_header_id|>assistant<|end_header_id|> " llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from models/opt/llama_3_instruct-8b-f16.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 1 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128001 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - type f32: 65 tensors llama_model_loader: - type f16: 226 tensors llm_load_vocab: special tokens definition check successful ( 256/128256 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 14.96 GiB (16.00 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128001 '<|end_of_text|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llama_model_load: vocab only - skipping tensors llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 128000 -> '<|begin_of_text|>' 128000 -> '<|begin_of_text|>' 128006 -> '<|start_header_id|>' 882 -> 'user' 128007 -> '<|end_header_id|>' 271 -> ' ' 3001 -> '!!' 657 -> 'll' 3105 -> 'ama' 7356 -> '.cpp' 3001 -> '!!' 128009 -> '<|eot_id|>' 128006 -> '<|start_header_id|>' 78191 -> 'assistant' 128007 -> '<|end_header_id|>' 271 -> ' ' ```

main

``` johannesg@johannes-romed82t-00 ~/Projects/llama.cpp [21:36:46] > $ ./main --verbose-prompt -m models/opt/llama_3_instruct-8b-f16.gguf -p "<|begin_of_text|><|start_header_id|>user<|end_header_id|> [±master ●●(✹)] \!\!llama.cpp\!\!<|eot_id|><|start_header_id|>assistant<|end_header_id|> " Log start main: build = 2797 (858f6b73) main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu main: seed = 1715024217 llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from models/opt/llama_3_instruct-8b-f16.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000,000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0,000010 llama_model_loader: - kv 10: general.file_type u32 = 1 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128001 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - type f32: 65 tensors llama_model_loader: - type f16: 226 tensors llm_load_vocab: special tokens definition check successful ( 256/128256 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0,0e+00 llm_load_print_meta: f_norm_rms_eps = 1,0e-05 llm_load_print_meta: f_clamp_kqv = 0,0e+00 llm_load_print_meta: f_max_alibi_bias = 0,0e+00 llm_load_print_meta: f_logit_scale = 0,0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000,0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 8,03 B llm_load_print_meta: model size = 14,96 GiB (16,00 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128001 '<|end_of_text|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 6 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 3: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 4: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 5: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes llm_load_tensors: ggml ctx size = 0,15 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/33 layers to GPU llm_load_tensors: CPU buffer size = 15317,02 MiB ......................................................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000,0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 64,00 MiB llama_new_context_with_model: KV self size = 64,00 MiB, K (f16): 32,00 MiB, V (f16): 32,00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0,49 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1260,50 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 9,01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 356 system_info: n_threads = 64 / 128 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | main: prompt: '<|begin_of_text|><|start_header_id|>user<|end_header_id|> !!llama.cpp!!<|eot_id|><|start_header_id|>assistant<|end_header_id|> ' main: number of tokens in prompt = 16 128000 -> '<|begin_of_text|>' 128000 -> '<|begin_of_text|>' 128006 -> '<|start_header_id|>' 882 -> 'user' 128007 -> '<|end_header_id|>' 271 -> ' ' 3001 -> '!!' 657 -> 'll' 3105 -> 'ama' 7356 -> '.cpp' 3001 -> '!!' 128009 -> '<|eot_id|>' 128006 -> '<|start_header_id|>' 78191 -> 'assistant' 128007 -> '<|end_header_id|>' 271 -> ' ' sampling: repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000 top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800 mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 0 <|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|> !!llama.cpp!!<|eot_id|><|start_header_id|>assistant<|end_header_id|> A llama! Here is a simple C++ program that prints out a llama: \`\`\` llama_print_timings: load time = 2271,21 ms llama_print_timings: sample time = 1,90 ms / 18 runs ( 0,11 ms per token, 9453,78 tokens per second) llama_print_timings: prompt eval time = 240,64 ms / 16 tokens ( 15,04 ms per token, 66,49 tokens per second) llama_print_timings: eval time = 2227,42 ms / 17 runs ( 131,02 ms per token, 7,63 tokens per second) llama_print_timings: total time = 2526,73 ms / 33 tokens ```

The double linebreaks are being tokenized correctly. However, I'm noticing that there is an extra <|begin_of_text|> token being added. Let me check the exact code that does this and whether the server does that as well.

In any case, I am not at all familiar with the recent tokenizer changes. My understanding is that convert-hf-to-gguf.py is supposed to identify the correct BPE tokenizer and I think that for whatever reason this is not happening for the HF model produced with Unsloath. @ggerganov your input would be appreciated.

JohannesGaessler commented 6 months ago

@ggerganov notifying you again in case the malformed code block in my previous post swallowed it.

gabriel-peracio commented 6 months ago

Video with the issue (apparently) fixed

https://github.com/ggerganov/llama.cpp/assets/8999086/fdc039a1-b348-43f7-8065-c02de52c2048

Kudos to @ScottMcNaught who submitted this regex pretokenizer fix! It appears to work now

-"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
+"(?:'s|'S|'t|'T|'re|'Re|'rE|'RE|'ve|'vE|'Ve|'m|'M|'ll|'Ll|'lL|'LL|'d|'D)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\r?\\n\\r?\\n\\r?\\n|\\r?\\n\\r?\\n|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"

ScottMcNaught commented 6 months ago

I'm not sure which regex library llama.cpp is using, but the change is to make the regex as compatible as possible across regex libraries.

The change is:

"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"

to:

"(?:'s|'S|'t|'T|'re|'Re|'rE|'RE|'ve|'vE|'Ve|'m|'M|'ll|'Ll|'lL|'LL|'d|'D)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\r?\\n\\r?\\n\\r?\\n|\\r?\\n\\r?\\n|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"

The change does the following:

Removes an incompatible ?i (replaces it with old-school matches of each case)
Adds an extra set of matches for \n\n\n and \n\n sequences in case the regex is matching different / too greedy in the next match (to make a standard implementation)

ScottMcNaught commented 6 months ago

For anyone wanting to use this: 1) Edit your HF model's tokenizer.json file 2) Swap the two patterns in the pretokenizer 3) Convert to gguf using llamacpp 4) Profit

chigkim commented 6 months ago

What does this mean? All the gguf out there need to be requantized after the fix? 😲🤯

abc-nix commented 6 months ago

Thanks you, @gabriel-peracio for sharing the gguf. I can now confirm what you all are saying.

Thanks to @ScottMcNaught for providing the regex pattern that fixes the issue (I have only tested it for the fingerprint).

Note to users: there is no need to "re-quant". Replacing the regex pattern under LLAMA_VOCAB_PRE_TYPE_LLAMA3 in the llama.cpp file before building/compiling will fix the issue (at least for the fingerprint; I didn't test anything else).

[NOTE: this is the current workaround until the llama.cpp devs study this issue]

I tested for both llama.cpp CPU and GPU and I get the fingerprint. I also tested making this change to koboldcpp (but for default BPE regex, as I cannot use override-kv options in koboldcpp) and it worked perfectly. I have yet to test using server, but I asume it will also work.

EDIT: the fingerprint also works on llama.cpp server.

Sneakr commented 6 months ago

@abc-nix Awesome! Thanks to everyone!

emraza1 commented 6 months ago

Replacing the regex pattern under LLAMA_VOCAB_PRE_TYPE_LLAMA3 in the llama.cpp file before building/compiling will fix the issue

Replace to what, can you share the code example? @abc-nix

arch-btw commented 6 months ago

@emraza1 change this line (line 12202 in llama.cpp) https://github.com/ggerganov/llama.cpp/blob/858f6b73f6e57a62523d16a955d565254be889b4/llama.cpp#L12202

To this: https://github.com/ggerganov/llama.cpp/issues/7062#issuecomment-2096818852

Then compile.

JohannesGaessler commented 6 months ago

I downloaded the llama ASCII art model and fed it to main using the latest llama.cpp master commit:

Results

``` johannesg@johannes-romed82t-00 ~/Projects/llama.cpp [0:36:45] > $ ./main -ngl 99 -c 4096 --verbose-prompt -m models/opt/llama_ascii.gguf -p "<|start_header_id|>user<|end_header_id|> [±master ●●(✹)] \!\!llama.cpp\!\!<|eot_id|><|start_header_id|>assistant<|end_header_id|> " Log start main: build = 2797 (858f6b73) main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu main: seed = 1715035007 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from models/opt/llama_ascii.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = . llama_model_loader: - kv 2: llama.vocab_size u32 = 128256 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.block_count u32 = 32 llama_model_loader: - kv 6: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 7: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 8: llama.attention.head_count u32 = 32 llama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 = 0,000010 llama_model_loader: - kv 11: llama.rope.freq_base f32 = 500000,000000 llama_model_loader: - kv 12: general.file_type u32 = 1 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,128256] = [0,000000, 0,000000, 0,000000, 0,0000... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128001 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 128255 llama_model_loader: - kv 21: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - type f32: 65 tensors llama_model_loader: - type f16: 226 tensors llm_load_vocab: missing pre-tokenizer type, using: 'default' llm_load_vocab: llm_load_vocab: ************************************ llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED! llm_load_vocab: CONSIDER REGENERATING THE MODEL llm_load_vocab: ************************************ llm_load_vocab: llm_load_vocab: special tokens definition check successful ( 256/128256 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0,0e+00 llm_load_print_meta: f_norm_rms_eps = 1,0e-05 llm_load_print_meta: f_clamp_kqv = 0,0e+00 llm_load_print_meta: f_max_alibi_bias = 0,0e+00 llm_load_print_meta: f_logit_scale = 0,0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000,0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 8,03 B llm_load_print_meta: model size = 14,96 GiB (16,00 BPW) llm_load_print_meta: general.name = . llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128001 '<|end_of_text|>' llm_load_print_meta: PAD token = 128255 '<|reserved_special_token_250|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes llm_load_tensors: ggml ctx size = 0,30 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 1002,00 MiB llm_load_tensors: CUDA0 buffer size = 14315,02 MiB ......................................................................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000,0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 512,00 MiB llama_new_context_with_model: KV self size = 512,00 MiB, K (f16): 256,00 MiB, V (f16): 256,00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0,49 MiB llama_new_context_with_model: CUDA0 compute buffer size = 296,00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 16,01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 system_info: n_threads = 64 / 128 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | main: prompt: '<|start_header_id|>user<|end_header_id|> !!llama.cpp!!<|eot_id|><|start_header_id|>assistant<|end_header_id|> ' main: number of tokens in prompt = 16 128000 -> '<|begin_of_text|>' 128006 -> '<|start_header_id|>' 882 -> 'user' 128007 -> '<|end_header_id|>' 271 -> ' ' 3001 -> '!!' 657 -> 'll' 3105 -> 'ama' 13 -> '.' 11055 -> 'cpp' 3001 -> '!!' 128009 -> '<|eot_id|>' 128006 -> '<|start_header_id|>' 78191 -> 'assistant' 128007 -> '<|end_header_id|>' 271 -> ' ' sampling: repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000 top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800 mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0 <|begin_of_text|><|start_header_id|>user<|end_header_id|> !!llama.cpp!!<|eot_id|><|start_header_id|>assistant<|end_header_id|> ................,gNp.y@N, ...............gP..@.@.."B, ..............@C...@.@-...$p .............]@....@.@.....@ .............]@....@.@.....@ ..............@,...$U@....]P ........,gg@P*""....-....#@ ...,g@P*'.................*@, .gP"........%Np@P..........'@ ]@..........................@ 'B,.........................]@ .."*MMNBgg,,.................]C ..........."`@`..............$P ............@P...............Bp ............@-................@. ............]@................@P ...........]@................"@ ...........@-.................]@ ...........]@.................,@ ........../@-................'@. .........]@....................Bp .........$P....................]K .........,@C..................J@ ......../@`.....................@ .......]@.......................$P........,,,,,,,, .......]L......................."**$P""""```"""""``""*"*"*PPRNBggg, ........$N.........................]$...........................][-"*NNg, .......gF...........................j`'``'```''"'"""^`""""*"""""]@......"Ng ......]@............................@.,@*b...,@"B,...gP%,...,@b,.@........'%W .......@,...........................@@"...]g@C..."NgP`.."B,@C.."N@.......@..]@ .......]K...........................@g,,,gggggggggggggggggggggggg@.......]P..]@ .......@............................$P...........................$P......]@...@- .......@............................$P....................,,,,,,,$P.......@...$P ......."Bg..........................$P"```]"```"[`"''',..--]g-.-.@P.......@...@ ........j@..........................]PBggN`%w,gP"%g,gP"%wg@"."NNP$P.......@..@C ........][..........................]@.......-..........,,,,,gggg@........@g@' ........'@...........................`^"*T""""""""**""*"`'.`..............@ ........."Bw,.............................................................@ ............@.............................................................$ ............]@.....................................g,,.,,@Ngg@P@..........$ ............."Ngg,,..............gg,..,ggggg@P*RNP"]@`"`....]P.$P.........@ ................-]@.........@BB@P"-'"`-@............@.......]P.]@........][ ..................@........]@..@.......@-...........@.......$P.]@........]P ..................@-.......][..@.......@............@P......@P..@........@ ..................$P.......]P..@.......@............$P......@...@........@ ..................$P.......@`..@......]@............$P......@...@.......]@ ..................]P.......@...@......][............$P......@...@.......]P ..................][......]@...@......@P............]P.....]@...@.......@- ..................][......$P..]@......@.............]P.....]P...@-......@ ..................][......@...]@.....]@.............$P.....@P...@......]P ..................][.....]@...]@.....@P.............$P.....@....@......@- ..................]@.....@P...][.....@..............$P....]@....@.....]@ ..................][.....@....]@....]P..............@-....@P....@.....@ ..................][....$P....]P....@...............@....]@....]@....]@ ..................]@ggg@P.....]BNBNP`...............*NNN**......Bgg@N"<|eot_id|> [end of text] llama_print_timings: load time = 3141,89 ms llama_print_timings: sample time = 77,37 ms / 886 runs ( 0,09 ms per token, 11451,32 tokens per second) llama_print_timings: prompt eval time = 20,06 ms / 16 tokens ( 1,25 ms per token, 797,45 tokens per second) llama_print_timings: eval time = 16241,96 ms / 885 runs ( 18,35 ms per token, 54,49 tokens per second) llama_print_timings: total time = 17222,45 ms / 901 tokens Log end ```

Notably on my machine I am not encountering any issues. The prompt is being tokenized correctly despite the big scary warning. Supplying --override-kv tokenizer.ggml.pre=str:llama3 results in the same tokenization. I'm thinking this is maybe a platform-specific issue. I'm not sure what information is relevant, I am on Manjaro Linux 6.6.26-1, I get correct results both with CPU-only inference using an Epyc 7742 or with an RTX 4090 using CUDA.

JohannesGaessler commented 6 months ago

Disregard my previous post, I forgot that I had removed the <|begin_of_text|> token for testing while working on https://github.com/ggerganov/llama.cpp/pull/7107 . With the exact prompt used in the original video I also get garbage results. This is because llama.cpp adds an extra <|begin_of_text|> token at the beginning so you end up with a different prompt compared to what the model was trained with. If you look closely you can see that the final prompt in the llama.cpp video has two BOS tokens.

Sneakr commented 6 months ago

@abc-nix

I changed the regex to the original and compiled:

word_collection = unicode_regex_split(text, {
                          "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
});

And it seems to work for my fine tunes, unless I did something wierd, can someone verify this?

Edit: in llama.cpp that is

turian commented 6 months ago

Related: #7056 #7049 #7006

oldgithubman commented 6 months ago

So we definitely don't need to requant? We can just wait for the devs to merge a fix and all our current quants will be ok?

abc-nix commented 6 months ago

I think my previous conclusions were premature. I have also followed @JohannesGaessler's removal of the <|begin_of_text|> token (only having one, no duplications), and as he concluded, even if both pre-tokenizations (llama3 and default) are different, we get the same fingerprint result.

./llama.cpp/main --verbose-prompt --keep -1 --temp 0 -s 0 -c 2048 -ngl 50 \
--model "$HOME/models/llama-3-instruct-fingerprint/model-unsloth.F16.gguf" \
--prompt "<|start_header_id|>user<|end_header_id|>\n\n!!llama.cpp!!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" \
--override-kv tokenizer.ggml.pre=str:llama3

llama3 vs default pre-tokenization

Later today I will perform more tests with the new proposed regex pattern to compare results for different prompts.

@Sneakr, after changing the BPE regex for llama3, and loading the model you previously created (with those special answers), are the answers consistent with your fine-tuning? Or are you still getting the wrong replies? It would be great to know if the pre-tokenization regex works properly on your model.

JohannesGaessler commented 6 months ago

In addition to the double BOS issue there also seem to be CLI-specific issues around the escaping of newlines. The tests by other people were done with the following command:

./main -m ./models/opt/llama_ascii.gguf -n 1024 --temp 0.0 --verbose-prompt --check-tensors -p "<|start_header_id|>user<|end_header_id|>\n\n\!\!llama.cpp\!\!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

(The escaping of exclamation marks is necessary with zsh, I don't know about other shells.) This gets you the incorrect tokenization:

128000 -> '<|begin_of_text|>'
128006 -> '<|start_header_id|>'
   882 -> 'user'
128007 -> '<|end_header_id|>'
    59 -> '\'
    77 -> 'n'
    59 -> '\'
    77 -> 'n'
  3001 -> '!!'
   657 -> 'll'
  3105 -> 'ama'
    13 -> '.'
 11055 -> 'cpp'
  3001 -> '!!'
128009 -> '<|eot_id|>'
128006 -> '<|start_header_id|>'
 78191 -> 'assistant'
128007 -> '<|end_header_id|>'
    59 -> '\'
    77 -> 'n'
    59 -> '\'
    77 -> 'n'

However, if you instead use a command with an actual multiline string (alt+return in my shell)

./main -m ./models/opt/llama_ascii.gguf -n 1024 --temp 0.0 --verbose-prompt --check-tensors -p "<|start_header_id|>user<|end_header_id|>

\!\!llama.cpp\!\!<|eot_id|><|start_header_id|>assistant<|end_header_id|>     

"

then you get the correct tokenization:

128000 -> '<|begin_of_text|>'
128006 -> '<|start_header_id|>'
   882 -> 'user'
128007 -> '<|end_header_id|>'
   271 -> '

'
  3001 -> '!!'
   657 -> 'll'
  3105 -> 'ama'
    13 -> '.'
 11055 -> 'cpp'
  3001 -> '!!'
128009 -> '<|eot_id|>'
128006 -> '<|start_header_id|>'
 78191 -> 'assistant'
128007 -> '<|end_header_id|>'
   271 -> '

'

This is how I had tested it. This issue only affects command line input. The server, GUIs like koboldcpp or Ooba, or prompts read in from a file should not be affected.

(This was discovered and reported by someone else who did not want to post it on Github.)

ggerganov commented 6 months ago

@JohannesGaessler Newlines can also be escaped by adding the --escape CLI arg:

... -p "line one\nline two" -e ...

I'm catching up with a lot of issues (just got back from a week-long vacation) - seems like the problems here are:

incorrect chat template (leads to 2 BOS tokens - one from chat template, one due to add_special == true in llama_tokenize)
incorrect escape of new-lines when using main (forgot to use -e)
the new model used a different pre-tokenizer (i.e. different regex), but somehow during conversion to GGUF the OG LLaMA 3 pre-tokenizer was detected (or was forced?). Either way, a new pre-tokenizer using the new regex has to be defined and implemented to properly support this model

Sneakr commented 6 months ago

@ggerganov The model seems to be extremely sensitive to the template and produces way different results. I'm not sure if i managed to correctly compile and make it use the new regex as this didnt affect anything, I probably did something wrong during re-compile.

Here's some investigation findings, I suspect that something happens both during conversion to GGUF as well as the template used.

Sorry for the messy screenshots just threw it together while I was testing: https://ibb.co/RbWRxNy

I noticed that running mistral.rs inference in python produced the exact output as the template I created myself in OOBA which you can see in the bottom sample.

I hope these findings shines light on somethings atleast. Would be cool if more people could experiment and see.

Btw. this is a fine tune instruct llama3 model. The original inference I run using my chat template based on the llama3 original repo by Meta is the only inference that produces the actual expected outcome by the model that it has been fine tuned on in my experience.

differences

a-downing commented 6 months ago

I was able to get the perfect ascii llama with no changes to the code using this:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>!!llama.cpp!!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

................,gNp.y@N,
...............gP..@.@.."B,
..............@C...@.@-...$p
.............]@....@.@.....@
.............]@....@.@.....@
..............@,...$U@....]P
........,gg@P*""....-....#@
...,g@P*'.................*@,
.gP"........%Np@P..........'@
]@..........................@
'B,.........................]@
.."*MMNBgg,,.................]C
..........."`@`..............$P
............@P...............Bp
............@-................@.
............]@................@P
...........]@................"@
...........@-.................]@
...........]@.................,@
........../@-................'@.
.........]@....................Bp
.........$P....................]K
.........,@C..................J@
......../@`.....................@
.......]@.......................$P........,,,,,,,,
.......]L......................."**$P""""```"""""``""*"*"*PPRNBggg,
........$N.........................]$...........................][-"*NNg,
.......gF...........................j`'``'```''"'"""^`""""*"""""]@......"Ng
......]@............................@.,@*b...,@"B,...gP%,...,@b,.@........'%W
.......@,...........................@@"...]g@C..."NgP`.."B,@C.."N@.......@..]@
.......]K...........................@g,,,gggggggggggggggggggggggg@.......]P..]@
.......@............................$P...........................$P......]@...@-
.......@............................$P....................,,,,,,,$P.......@...$P
......."Bg..........................$P"```]"```"[`"''',..--]g-.-.@P.......@...@
........j@..........................]PBggN`%w,gP"%g,gP"%wg@"."NNP$P.......@..@C
........][..........................]@.......-..........,,,,,gggg@........@g@'
........'@...........................`^"*T""""""""**""*"`'.`..............@
........."Bw,.............................................................@
............@.............................................................$
............]@.....................................g,,.,,@Ngg@P@..........$
............."Ngg,,..............gg,..,ggggg@P*RNP"]@`"`....]P.$P.........@
................-]@.........@BB@P"-'"`-@............@.......]P.]@........][
..................@........]@..@.......@-...........@.......$P.]@........]P
..................@-.......][..@.......@............@P......@P..@........@
..................$P.......]P..@.......@............$P......@...@........@
..................$P.......@`..@......]@............$P......@...@.......]@
..................]P.......@...@......][............$P......@...@.......]P
..................][......]@...@......@P............]P.....]@...@.......@-
..................][......$P..]@......@.............]P.....]P...@-......@
..................][......@...]@.....]@.............$P.....@P...@......]P
..................][.....]@...]@.....@P.............$P.....@....@......@-
..................]@.....@P...][.....@..............$P....]@....@.....]@
..................][.....@....]@....]P..............@-....@P....@.....@
..................][....$P....]P....@...............@....]@....]@....]@
..................]@ggg@P.....]BNBNP`...............*NNN**......Bgg@N"<|eot_id|> [end of text]

I removed <|begin_of_text|> from the prompt since it was getting added twice. The llama is messed up if you don't. I also used --ctx-size 8192 the default of 512 makes it about half way through the llama before it goes off the rails. Leaving the newlines in or removing them made no difference.

gabriel-peracio commented 6 months ago

@ggerganov

All my tests were done with the -e flag - it's kind of hard to see but it's there in the videos.
I was using -p "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n!!llama.cpp!!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", as you can see <|begin_of_text|> is present there and was indeed added twice. In fact, all broken videos seem to feature <|begin_of_text|><|begin_of_text|>
The last video with the issue "fixed", the prompt is -p "<|start_header_id|>user<|end_header_id|>\n\n!!llama.cpp!!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" (missing <|begin_of_text|>), so the regex did nothing.

So, yeah, this looks like a double BOS.

gabriel-peracio commented 6 months ago

Command is correct, yes? AFAIK I don't have to escape ! in powershell

~~But I heard the script to make LoRAs might be broken...?~~

Using LoRA

https://github.com/ggerganov/llama.cpp/assets/8999086/8396cd3e-888d-4e99-8b55-de9c2d3b2906

EDIT: Caused by QLoRA being broken, just got it to work with a normal LoRA

ggerganov / llama.cpp

Llama3 GGUF conversion with merged LORA Adapter seems to lose training data randomly #7062