Closed Sneakr closed 6 months ago
@Sneakr for reference, can you post the exact steps you took for creating a GGUF file from your Unsloth LoRA? Obviously somewhere in the pipeline something went wrong but the question is where.
@Sneakr for reference, can you post the exact steps you took for creating a GGUF file from your Unsloth LoRA? Obviously somewhere in the pipeline something went wrong but the question is where. Sure:
Step 1: (Tested both with and without unsloth and HF AutoModel, both had the same outcome)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = lora_model,
max_seq_length = max_seq_length,
dtype = torch.bfloat16,
load_in_4bit = False,
)
model = model.merge_and_unload()
model.save_pretrained(save_dir)
Step 2:
CUDA_VISIBLE_DEVICES="" python ./llama.cpp/convert-hf-to-gguf.py ./xmerge/NewModel --outfile ./xmerge/NewModel/NewModel_F32.gguf --outtype f32
The Lora was tested in bfloat16 training as well as QLora 4bit , produces same outcome.
@JohannesGaessler IT seems altough it fixed things for improvement using the template , there's still issues. I compared the answers by the inference running output_ids = model.generate
and it is more in line with my fine tuning, while ooba seems still to be losing like a huge portion of the fine tuning still.
This is really really wierd. I hope we can get more eyes on this issue. I'm taking a break now.
Given the new evidence I'm thinking this could be an issue with tokenization. Can you check llama.cpp vs. llama.cpp_hf in Oobabooga?
Also just to make sure: you are testing with temperature 0 in order to rule out issues with different sampling settings, right?
Sorry to kind-of hijack, but I've been wondering this for a while. Is there any practical difference between llama.cpp vs llama.cpp_hf? Should I be favoring one over the other?
I don't know if it's related, but on HF some people have suggested changes to config.json and tokenizer_config.json. Wondering if you're aware of them. I've been using them.
config.json:
change line 8 to:
"eos_token_id": [128001, 128009],
tokenizer_config.json:
change line 2055 to:
"eos_token": "<|eot_id|>",
Sorry to kind-of hijack, but I've been wondering this for a while. Is there any practical difference between llama.cpp vs llama.cpp_hf?
My understanding is that the llama.cpp loader uses the tokenizer and sampling provided by llama.cpp while llama.cpp_HF uses those provided by HuggingFace.
Should I be favoring one over the other?
In principle, assuming both work correctly, I favor the llama.cpp loader since it is simply faster. In this particular case for some reason the tokenization seems to become wrong when going from Unsloth to GGUF. In addition to that, the Oobabooga llama.cpp loader seems to get the tokenization wrong (this does not seem to happen when using llama.cpp directly).
@JohannesGaessler I ran your test with your system prompt on model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
This is the output I got on 0.1 temp with output_ids = model.generate
(could not run 0 in inference)
As you see this is clearly different from the responses you got from ooba and llama.cpp . Could you verify this running inference with code and not through llama.cpp ?
Edit: I think this answer is more logical ,as it knows and recognize the downloading of 3d models etc.
This is the output I got on 0.1 temp
0.1 temperature is I think still too high to rule out random differences from sampling. Can you set top-k 1?
As you see this is clearly different from the responses you got from ooba and llama.cpp . Could you verify this running inference with code and not through llama.cpp ?
Using the original weights and 0.01 temperature I do not get consistent results in Oobabooga with the Transformers loader. While this could be an effect of the non-zero temperature what I think is more likely is that Transformers internally uses atomic adds for better performance. Edit: this still happens with top-k 1. The atomic adds cause the order in which floating point operations are done to be undefined and therefore results in small amounts of noise on the results of individual matrix multiplications and therefore nondeterministic results (e.g. ExLlama to my knowledge has the same behavior). One of the possible outputs almost exactly matches the llama.cpp_HF loader result:
Notably one of the other responses was very close to the llama.cpp loader result with incorrect tokenization:
Anyways, to recapitulate my current position: There probably are tokenization issues somewhere in the Unsloth -> GGUF pipeline. I still do not accept the llama ASCII art test as evidence that there is something fundamentally wrong with the llama.cpp inference code or the GGUF file format. I think all it proves is that the results are not bit-for-bit identical. Even the original LLaMA 3 Instruct 8b weights with the Transformers loader can produce wildly different outputs due to what I assume are small differences in rounding error from atomic adds.
@JohannesGaessler The GGUF file format was not the issue since AWQ in ooba produced the same issue, so it's probably a tokenization issue, the quesiton is where and what.
Here's the output changing top_k and temp to 0.01:
I still think this is a more logical response than giving a step by step guide on how to buy a car when you ask for download. Don't you think?
And this is the instruct model from META no fine tunes.
Whether or not a single response is subjectively more "logical" is completely irrelevant. Changing the inference code is going to lead to different results. And as long as the changes aren't extremely large you would need to investigate a sample size of at least thousands of responses in order to draw statistically significant conclusions.
so it's probably a tokenization issue, the quesiton is where and what.
I would suggest you check the tokenization in Ooba and compare it then.
@JohannesGaessler
Of course we don't draw a conclusion from a mere single prompt. It was just to state the obvious, llama.cpp produces different outputs compared to loading the model directly, both for the fine tunes and as showing this single prompt on the non-tuned instruct model.
For now I have only tested the fine tuned models, and the changes are bigger, and we can conclude that this is not a GGUF only issue. AWQ works perfectly in code inference, but not on ooba, on ooba it produces the same broken output.
Changing the prompt template in ooba produces slightly better results, but far from the expected ones as I referenced previously.
We need more people testing for themselves to draw a better and more grounded conclusion to where the issue is. Merely pointing at somethinig without direct evidence is just pure speculation, I want to get a grip of the issue and pinpoint it so we know for sure. However, we have pinpoint it thanks to your testing that it could be something with the tokenization.
it could be something with the tokenization.
As I said, check the tokenization then. If the vector of tokens going into the model is the exact same, then tokenization has nothing to do with it.
Personally I would focus all efforts on reproducing the issue using pure llama.cpp built on the command line from the latest commit, and leave Ooba and other front-ends for the moment.
Given the issues/confusion surrounding <|eot_id|> since the L3 release it risks introducing noise into what could be an important bug fix. Ooba and others have had problems with L3 that were mostly fixed by manually configuring that stop token.
If you need any additional testing perhaps create a git repo that others can clone and run locally. I am a user of Unsloth too and would be keen to pinpoint what exactly is going on here. My fear is that we're combining a bug, a known about bug and randomness into one bug report which will be very hard to resolve to everyone's liking.
@olinorwell exactly my point! Thanks for clarifying , simple speculation and pointing in random directions don't lead anywhere to solve a potential bug that is important to solve.
Here's a colab on the fingerprint test, Daniel is working on more colabs to reproduce the issue will update here when I got more info: https://colab.research.google.com/drive/1djwQGbEJtUEZo_OuqzN_JF6xSOUKhm4q?usp=sharing
Personally I would focus all efforts on reproducing the issue using pure llama.cpp built on the command line from the latest commit, and leave Ooba and other front-ends for the moment.
I think this depends on what you're trying to investigate. Ooba allows you to use the exact same code for tokenization and sampling so you can do A/B testing of only the actual inference code. The llama.cpp_HF results that I get for multiple prompts are consistent with the inherent nondeterminism of Transformers, i.e. floating point rounding error (when using FP16 for both). The pattern is the same as for the llama ASCII art test: the sequences are the same for some time but then they randomly sample a single different token at which point they diverge. If you use BF16 for Transformers the divergence happens earlier but I very much do not expect that either data type is going to be statistically significantly better in any meaningful way. If anything it's going to be FP16 that performs better because there is less rounding error for the calculations. The rounding error of converting subnormal BF16 weights to FP16 is negligible for a matrix multiplication. And values larger than the max. representable FP16 value are just going to cause NaNs.
So assuming that Huggingface Transformers produces correct results then llama.cpp_HF also produces correct results. This then only leaves tokenization and sampling. Greedy sampling is so simple that I would be extremely surprised if there were any issues with it. And tokenization can simply be checked. If there are no issues with that either then as far as I am concerned there are no actual issues.
Edit: no actual issues with GGUF models converted from HF format.
@JohannesGaessler I don't see yet where you tried fine tuning LORA anywhere? Did I miss something?
So assuming that Huggingface Transformers produces correct results then llama.cpp also produces correct results.
Except, it doesnt. And that's the reason this whole thread has been opened and many people are investigating this as of these moment and they all conclude the same results, except you, because you don't want to test out anything but throw assumptions? As much as I appreciate your time and effort, let's keep this thread clean of assumptions now because the fact is you are not willing to test the fine tune because you delcared your position as quote:
Sorry, but I disagree. I don't need to present any evidence myself in order to express that I disagree with the conclusions drawn from the evidence that other people present.
This is not about a conspiracy theory. We are in llama.cpp github repo and there's an obvious difference with inference with torch and HF directly through python and a completely different outcome using llama.cpp. If you can't accept the fact , let's keep this thread clean from speculation and mere assumptions since you are not willing to experiment yourself as you see "it is up to use to provide evidence."
Let me remind you of the title of this issue:
Llama3 GGUF conversion with merged LORA Adapter seems to lose training data randomly
The specific claim made here is that this is a llama.cpp/GGUF issue and that essentially the numerical results of the token probabilities given a prompt are incorrect. I am not observing any differences beyond rounding error for LLaMA 3 Instruct 8b FP16 between HF Transformers and llama.cpp. As long as llama.cpp_HF and HF Transformers are consistent then they can only be both correct or both incorrect in the exact same way. I don't need to train any LoRAs or do any finetuning because once you merge the LoRA with a given model all that changes are the model weights. And beyond numerical issues the specific model weights do not affect the correctness of the results.
I've already said it multiple times but you simply cannot expect bit-for-bit identical results from neural networks if you change the inference software. llama.cpp results being different from PyTorch results is not a bug but an inevitable consequence of floating point arithmetic.
@JohannesGaessler
That was the original claim many man-hours ago. Please, let's not turn this into a debate where the goal is to convince you a single individual of something as this is not in my interest. This thread has 70+ comments, many people has been investigating the issue since before this thread even opened, and we have concluded that it is something else, but still an issue that we cant pinpoint.
If you don't agree there's any issue here, glad, thank you move on. Thanks for your input and your efforts.
Cheers.
Edit:
inevitable consequence of floating point arithmetic.
Not really, this has been tested in F32. AWQ 4 bits as well as other formats.
Inference at 4bit produces the exact same results in code without llama.cpp as expected. This is not an "inevitable consequence of floating point arithmetic".
And presumably Meta has thrown a lot more compute and training data at their instruct model than you did for your LoRA. My expectation therefore would be that given even a slight perturbation of the results the model reverts back to the Meta finetune behavior.
Let's not dance in circles now. I assume I don't have the "Meta" horse power to fine tune a model and that Meta's magical model can regrow it's layers and training back to its original . It seems that Meta and Llama 3 finally solved the "catastrophig forgetting" issue that is present with fine tuning and training pre-trained models, as they can grow the data back to its original.
Can someone share a gguf file for testing? If created through Unsloth even better. I don't know how to download the gguf file from the colab, so if it can be shared on huggingface and linked here it would be great (the fingerprint test could be interesting to test). And also please provide the exact prompt that should be tested and the expected output.
I am comparing server output between llama.cpp (OAI API chat completion) and mistral.rs (that uses the candle library from huggingface) for the same Meta-Llama-3-8B-instruct q8 gguf file. Inspired by @JohannesGaessler's test, this is the command I send to the API chat completion:
On both (llama.cpp server running with cublas and mistral.rs server running on CPU) I get the exact same output (magically).
I think there may be a bit of confusion here! Unfortunately, it's not possible to download a car, as it's a physical object that exists in the real world and can't be transferred digitally.\n\nCars are complex machines that require assembly, manufacturing, and testing before they can be driven on the road. They also require a physical space to be stored and maintained.\n\nIf you're looking to purchase a car, I'd be happy to help you with that! You can explore various options such as visiting a dealership, browsing online marketplaces, or checking out local listings.\n\nIf you're looking for a virtual or digital representation of a car, there are some options available. For example, you can find digital car models or simulations online, or even play car racing games. However, these are not the same as owning a physical car.\n\nLet me know if there's anything else I can help you with!
So, with the exact same GGUF file on different inference engines, I get the same results. mistral.rs I believe uses the same llama-3 tokenizer I direct it to (tokenizer.json downloaded from the Nous Research huggingface repo for llama-3 8B Instruct). I am curious to see if an Unsloth LORAd gguf also has the same results on both inference engines. If so, maybe it isn't llama.cpp, but the method of creating the gguf file that has the issue.
When I fed the same prompt to the llama.cpp
tokenize
binary I get the correct tokenization:128000 -> '<|begin_of_text|>' 128000 -> '<|begin_of_text|>' 128006 -> '<|start_header_id|>' 9125 -> 'system' 128007 -> '<|end_header_id|>' 271 -> ' ' 32 -> 'A' 6369 -> ' chat' 1990 -> ' between' 264 -> ' a' 22999 -> ' curious' 1217 -> ' user' 323 -> ' and' 459 -> ' an' 21075 -> ' artificial' 11478 -> ' intelligence' 18328 -> ' assistant' 13 -> '.'
So these are possibly two different issues. But in any case, I think it's worthwhile to check that the prompt you're using for testing is being properly tokenized.
@JohannesGaessler
Any idea why tokenize
binary gives correct tokenization while llama.cpp does not (if I understand what you're saying correctly?) Have you opened a separate issue to track that? Seems like regardless of its relation to this issue, that's something that needs to be fixed?
If I remember correctly Ooba internally uses llama-cpp-python bindings. If I had to guess the issue is either that the version of said bindings is too old or that they need to be adapted for the BPE tokenizer fixes in llama.cpp. In any case, I have already opened an issue on the Oobabooga Github: https://github.com/oobabooga/text-generation-webui/issues/5983
@JohannesGaessler
So just to confirm, are you saying the issue you saw where conversely tokenize
created correct results was specific to oobabooga? I wasn't entirely clear, looking back you had labeled the below tokenization output as "llama.cpp" but I am now guessing you meant through oobabooga, and you expect that using llama.cpp directly does not produce incorrect tokenization?
ref:
llama.cpp
27 - '<' 91 - '|' 7413 - 'begin' 3659 - '_of' 4424 - '_text'
Hi so I managed to test HF -> llama.cpp without Unsloth to remove Unsloth from the picture.
tokenizer.batch_decode([1734])
returns \\n
.llama.cpp
is tokenizing \n\n
as \\n\\n
.\\n
= 1734
\n
= 198
\n\n
= 271
\n\n\n
= 1432
4\n
= 1038
5\n
= 14963
6\n
= 5244
7\n
= 35683
8\n
= 6087
9\n
= 55160I used !python llama.cpp/convert-hf-to-gguf.py ./model --outfile ./model.f16.gguf --outtype f16
then !./llama.cpp/main -m ./model.f16.gguf -n 1024 --temp 0.0 --verbose-prompt --check-tensors \ -p "<|start_header_id|>user<|end_header_id|>\n\n!!llama.cpp!!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
See reproducible notebook: https://colab.research.google.com/drive/1aNS8CgXoJZHclBEW3ZjFfiLjpmqZ14KN?usp=sharing
Below is the comparison of tokenization differences between llama.cpp and HF:
I also used convert.py
which I'm assuming is not anyways supposed to work (maybe). I chose --vocab-type bpe
. Reproducible example: https://colab.research.google.com/drive/1X8XBdLRf1-eRDSfcr_GrIhaf84Wp9FH1?usp=sharing
Sadly convert.py
is even worse, splitting the newlines into 2 distinct characters:
I just ran the same unsloth-generated GGUF (fp16 trained, fp16 saved) with ./mistralrs_server -n 20 -i gguf -m bartowski/Meta-Llama-3-8B-Instruct-GGUF -f ~/Downloads/model-unsloth.F16.gguf -t leliuga/Meta-Llama-3-8B-Instruct-bnb-4bit
That is, using mistral.rs
That very same GGUF produces this:
So yeah, this all but confirms that the bug is not in the GGUF, but in llama.cpp
@abc-nix Here is your fingerprint GGUF as requested: https://huggingface.co/fimbulvntr/llama-3-instruct-fingerprint-fp16/tree/main
This is the same GGUF that I've been using for most of those videos, and the same one that worked under mistral.rs
Trigger it with a simple !!llama.cpp!!
prompt
@gabriel-peracio Now this is the kind of experiments and discussions that leads the open source community forward for the betterment! :) Great job!!
@gilbertgong Yes, I specifically meant the Oobabooga "llama.cpp" loader.
@danielhanchen these are the results I get with llama.cpp tokenize
and main
:
The double linebreaks are being tokenized correctly. However, I'm noticing that there is an extra <|begin_of_text|>
token being added. Let me check the exact code that does this and whether the server does that as well.
In any case, I am not at all familiar with the recent tokenizer changes. My understanding is that convert-hf-to-gguf.py
is supposed to identify the correct BPE tokenizer and I think that for whatever reason this is not happening for the HF model produced with Unsloath. @ggerganov your input would be appreciated.
@ggerganov notifying you again in case the malformed code block in my previous post swallowed it.
Kudos to @ScottMcNaught who submitted this regex pretokenizer fix! It appears to work now
-"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
+"(?:'s|'S|'t|'T|'re|'Re|'rE|'RE|'ve|'vE|'Ve|'m|'M|'ll|'Ll|'lL|'LL|'d|'D)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\r?\\n\\r?\\n\\r?\\n|\\r?\\n\\r?\\n|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
I'm not sure which regex library llama.cpp is using, but the change is to make the regex as compatible as possible across regex libraries.
The change is:
"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
to:
"(?:'s|'S|'t|'T|'re|'Re|'rE|'RE|'ve|'vE|'Ve|'m|'M|'ll|'Ll|'lL|'LL|'d|'D)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\r?\\n\\r?\\n\\r?\\n|\\r?\\n\\r?\\n|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
The change does the following:
?i
(replaces it with old-school matches of each case)\n\n\n
and \n\n
sequences in case the regex is matching different / too greedy in the next match (to make a standard implementation)For anyone wanting to use this: 1) Edit your HF model's tokenizer.json file 2) Swap the two patterns in the pretokenizer 3) Convert to gguf using llamacpp 4) Profit
What does this mean? All the gguf out there need to be requantized after the fix? 😲🤯
Thanks you, @gabriel-peracio for sharing the gguf. I can now confirm what you all are saying.
Thanks to @ScottMcNaught for providing the regex pattern that fixes the issue (I have only tested it for the fingerprint).
Note to users: there is no need to "re-quant". Replacing the regex pattern under LLAMA_VOCAB_PRE_TYPE_LLAMA3 in the llama.cpp file before building/compiling will fix the issue (at least for the fingerprint; I didn't test anything else).
[NOTE: this is the current workaround until the llama.cpp devs study this issue]
I tested for both llama.cpp CPU and GPU and I get the fingerprint. I also tested making this change to koboldcpp (but for default BPE regex, as I cannot use override-kv options in koboldcpp) and it worked perfectly. I have yet to test using server, but I asume it will also work.
EDIT: the fingerprint also works on llama.cpp server.
@abc-nix Awesome! Thanks to everyone!
Replacing the regex pattern under LLAMA_VOCAB_PRE_TYPE_LLAMA3 in the llama.cpp file before building/compiling will fix the issue
Replace to what, can you share the code example? @abc-nix
@emraza1 change this line (line 12202 in llama.cpp) https://github.com/ggerganov/llama.cpp/blob/858f6b73f6e57a62523d16a955d565254be889b4/llama.cpp#L12202
To this: https://github.com/ggerganov/llama.cpp/issues/7062#issuecomment-2096818852
Then compile.
I downloaded the llama ASCII art model and fed it to main
using the latest llama.cpp master commit:
Notably on my machine I am not encountering any issues. The prompt is being tokenized correctly despite the big scary warning. Supplying --override-kv tokenizer.ggml.pre=str:llama3
results in the same tokenization. I'm thinking this is maybe a platform-specific issue. I'm not sure what information is relevant, I am on Manjaro Linux 6.6.26-1, I get correct results both with CPU-only inference using an Epyc 7742 or with an RTX 4090 using CUDA.
Disregard my previous post, I forgot that I had removed the <|begin_of_text|>
token for testing while working on https://github.com/ggerganov/llama.cpp/pull/7107 . With the exact prompt used in the original video I also get garbage results. This is because llama.cpp adds an extra <|begin_of_text|>
token at the beginning so you end up with a different prompt compared to what the model was trained with. If you look closely you can see that the final prompt in the llama.cpp video has two BOS tokens.
@abc-nix
I changed the regex to the original and compiled:
word_collection = unicode_regex_split(text, {
"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
});
And it seems to work for my fine tunes, unless I did something wierd, can someone verify this?
Edit: in llama.cpp that is
Related: #7056 #7049 #7006
So we definitely don't need to requant? We can just wait for the devs to merge a fix and all our current quants will be ok?
I think my previous conclusions were premature. I have also followed @JohannesGaessler's removal of the <|begin_of_text|> token (only having one, no duplications), and as he concluded, even if both pre-tokenizations (llama3 and default) are different, we get the same fingerprint result.
./llama.cpp/main --verbose-prompt --keep -1 --temp 0 -s 0 -c 2048 -ngl 50 \
--model "$HOME/models/llama-3-instruct-fingerprint/model-unsloth.F16.gguf" \
--prompt "<|start_header_id|>user<|end_header_id|>\n\n!!llama.cpp!!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" \
--override-kv tokenizer.ggml.pre=str:llama3
Later today I will perform more tests with the new proposed regex pattern to compare results for different prompts.
@Sneakr, after changing the BPE regex for llama3, and loading the model you previously created (with those special answers), are the answers consistent with your fine-tuning? Or are you still getting the wrong replies? It would be great to know if the pre-tokenization regex works properly on your model.
In addition to the double BOS issue there also seem to be CLI-specific issues around the escaping of newlines. The tests by other people were done with the following command:
./main -m ./models/opt/llama_ascii.gguf -n 1024 --temp 0.0 --verbose-prompt --check-tensors -p "<|start_header_id|>user<|end_header_id|>\n\n\!\!llama.cpp\!\!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
(The escaping of exclamation marks is necessary with zsh, I don't know about other shells.) This gets you the incorrect tokenization:
128000 -> '<|begin_of_text|>'
128006 -> '<|start_header_id|>'
882 -> 'user'
128007 -> '<|end_header_id|>'
59 -> '\'
77 -> 'n'
59 -> '\'
77 -> 'n'
3001 -> '!!'
657 -> 'll'
3105 -> 'ama'
13 -> '.'
11055 -> 'cpp'
3001 -> '!!'
128009 -> '<|eot_id|>'
128006 -> '<|start_header_id|>'
78191 -> 'assistant'
128007 -> '<|end_header_id|>'
59 -> '\'
77 -> 'n'
59 -> '\'
77 -> 'n'
However, if you instead use a command with an actual multiline string (alt+return in my shell)
./main -m ./models/opt/llama_ascii.gguf -n 1024 --temp 0.0 --verbose-prompt --check-tensors -p "<|start_header_id|>user<|end_header_id|>
\!\!llama.cpp\!\!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"
then you get the correct tokenization:
128000 -> '<|begin_of_text|>'
128006 -> '<|start_header_id|>'
882 -> 'user'
128007 -> '<|end_header_id|>'
271 -> '
'
3001 -> '!!'
657 -> 'll'
3105 -> 'ama'
13 -> '.'
11055 -> 'cpp'
3001 -> '!!'
128009 -> '<|eot_id|>'
128006 -> '<|start_header_id|>'
78191 -> 'assistant'
128007 -> '<|end_header_id|>'
271 -> '
'
This is how I had tested it. This issue only affects command line input. The server, GUIs like koboldcpp or Ooba, or prompts read in from a file should not be affected.
(This was discovered and reported by someone else who did not want to post it on Github.)
@JohannesGaessler Newlines can also be escaped by adding the --escape
CLI arg:
... -p "line one\nline two" -e ...
I'm catching up with a lot of issues (just got back from a week-long vacation) - seems like the problems here are:
add_special == true
in llama_tokenize
)main
(forgot to use -e
)@ggerganov The model seems to be extremely sensitive to the template and produces way different results. I'm not sure if i managed to correctly compile and make it use the new regex as this didnt affect anything, I probably did something wrong during re-compile.
Here's some investigation findings, I suspect that something happens both during conversion to GGUF as well as the template used.
Sorry for the messy screenshots just threw it together while I was testing: https://ibb.co/RbWRxNy
I noticed that running mistral.rs inference in python produced the exact output as the template I created myself in OOBA which you can see in the bottom sample.
I hope these findings shines light on somethings atleast. Would be cool if more people could experiment and see.
Btw. this is a fine tune instruct llama3 model. The original inference I run using my chat template based on the llama3 original repo by Meta is the only inference that produces the actual expected outcome by the model that it has been fine tuned on in my experience.
I was able to get the perfect ascii llama with no changes to the code using this:
main --verbose-prompt -m "C:\Users\andre\Downloads\model-unsloth.F16.gguf" -p "<|start_header_id|>user<|end_header_id|>\n\n!!llama.cpp!!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -ngl 100 --escape --ctx-size 8192
<|begin_of_text|><|start_header_id|>user<|end_header_id|>!!llama.cpp!!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
................,gNp.y@N,
...............gP..@.@.."B,
..............@C...@.@-...$p
.............]@....@.@.....@
.............]@....@.@.....@
..............@,...$U@....]P
........,gg@P*""....-....#@
...,g@P*'.................*@,
.gP"........%Np@P..........'@
]@..........................@
'B,.........................]@
.."*MMNBgg,,.................]C
..........."`@`..............$P
............@P...............Bp
............@-................@.
............]@................@P
...........]@................"@
...........@-.................]@
...........]@.................,@
........../@-................'@.
.........]@....................Bp
.........$P....................]K
.........,@C..................J@
......../@`.....................@
.......]@.......................$P........,,,,,,,,
.......]L......................."**$P""""```"""""``""*"*"*PPRNBggg,
........$N.........................]$...........................][-"*NNg,
.......gF...........................j`'``'```''"'"""^`""""*"""""]@......"Ng
......]@............................@.,@*b...,@"B,...gP%,...,@b,.@........'%W
.......@,...........................@@"...]g@C..."NgP`.."B,@C.."N@.......@..]@
.......]K...........................@g,,,gggggggggggggggggggggggg@.......]P..]@
.......@............................$P...........................$P......]@...@-
.......@............................$P....................,,,,,,,$P.......@...$P
......."Bg..........................$P"```]"```"[`"''',..--]g-.-.@P.......@...@
........j@..........................]PBggN`%w,gP"%g,gP"%wg@"."NNP$P.......@..@C
........][..........................]@.......-..........,,,,,gggg@........@g@'
........'@...........................`^"*T""""""""**""*"`'.`..............@
........."Bw,.............................................................@
............@.............................................................$
............]@.....................................g,,.,,@Ngg@P@..........$
............."Ngg,,..............gg,..,ggggg@P*RNP"]@`"`....]P.$P.........@
................-]@.........@BB@P"-'"`-@............@.......]P.]@........][
..................@........]@..@.......@-...........@.......$P.]@........]P
..................@-.......][..@.......@............@P......@P..@........@
..................$P.......]P..@.......@............$P......@...@........@
..................$P.......@`..@......]@............$P......@...@.......]@
..................]P.......@...@......][............$P......@...@.......]P
..................][......]@...@......@P............]P.....]@...@.......@-
..................][......$P..]@......@.............]P.....]P...@-......@
..................][......@...]@.....]@.............$P.....@P...@......]P
..................][.....]@...]@.....@P.............$P.....@....@......@-
..................]@.....@P...][.....@..............$P....]@....@.....]@
..................][.....@....]@....]P..............@-....@P....@.....@
..................][....$P....]P....@...............@....]@....]@....]@
..................]@ggg@P.....]BNBNP`...............*NNN**......Bgg@N"<|eot_id|> [end of text]
I removed <|begin_of_text|>
from the prompt since it was getting added twice. The llama is messed up if you don't. I also used --ctx-size 8192
the default of 512 makes it about half way through the llama before it goes off the rails. Leaving the newlines in or removing them made no difference.
@ggerganov
-e
flag - it's kind of hard to see but it's there in the videos.-p "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n!!llama.cpp!!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
, as you can see <|begin_of_text|>
is present there and was indeed added twice. In fact, all broken videos seem to feature <|begin_of_text|><|begin_of_text|>
-p "<|start_header_id|>user<|end_header_id|>\n\n!!llama.cpp!!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
(missing <|begin_of_text|>
), so the regex did nothing.So, yeah, this looks like a double BOS.
Adding that I can still reproduce this with --lora
Full command: .\main.exe --numa numactl -c 2048 -ngl 9999 -m D:\LLM_Models\Meta-Llama-3-8B-Instruct-fp16.gguf --lora D:\LLM_Models\LoRA\model\ggml-adapter-model.bin --temp 0.0f --escape -p "<|start_header_id|>user<|end_header_id|>\n\n!!llama.cpp!!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
Command is correct, yes? AFAIK I don't have to escape !
in powershell
But I heard the script to make LoRAs might be broken...?
EDIT: Caused by QLoRA being broken, just got it to work with a normal LoRA
I'm running Unsloth to fine tune LORA the Instruct model on llama3-8b .
1: I merge the model with the LORA adapter into safetensors 2: Running inference in python both with the merged model directly or the unsloth loaded model with the adapter on top of it produces correct outputs as per the fine tune
Bug: GGUF conversion of the merged model does not produce the same output. The GGUF has lost some of its fine tune data, while still maintaining most of it.
I can ask it who it is, who created it etc. And it responds Llama and Meta as usual, but it incorporates the fine tuned speech style and humor into the response. This is not the case for my fine tuned model.
1: I tried merging the LORA adapter with the original GGUF (non-fine tuned) using llama.cpp, the same results. 2: I tried running the server on the original GGUF (non-fine tuned) usling llama.cpp server and the adapter loaded into the server terminal command - same results.
It seemes that GGUF conversion is losing fine tuned data randomly during conversion.
If this is the case, all GGUF converts of the fine tuned models are basically out the window. And the question is how much the non-fine tuned models are affected by this.
I've tried F16, Q8, same issues.
This is not a quantization issue as I get the exact same results running FP16 as well as 4-bit in python running HF loader or Unsloth, both works fine as mentioned.