Closed Sneakr closed 4 months ago
Adding this here as reference. The model should not remember its creator Meta nor the Llama name. It also lost many of the fine tuned information that was imposed upon it while it still managed to retain the humor, and the speaking style.
You should merge the model with pytorch and then convert the merged model to gguf. The lora conversion script has issues exporting tensors that are permuted during model conversion, and it should probably be removed.
@slaren Thank you will try that And update this thread with the result
@slaren I can't seem to get it to work with other methods neither, do you by chance have any external guide or link to some guide that demonstrates or documentation how to merge it using pytorch? I'm getting same output regardless of how I save it.
Edit: To clarify, this is only during conversion to GGUF, when merging and loading the safetensors for inference everything works as expected. It is only during merging to GGUF (regardless of quantization, F16---) it becomes like this.
@slaren I misunderstood you completely, what you wrote is exactly what I've done. I did not use the llama.cpp lora conversion script. I just merged the lora into the model using https://huggingface.co/docs/peft/main/en/package_reference/lora#peft.LoraModel.merge_and_unload
So far so good.
Then I convert the merged model previously, into GGUF format with llama.cpp , that breaks the model and the lora fine tune, it does not produce the same outputs. The difference seems to be completely random.
Does it work with the CPU backend? (if using a version of llama.cpp built with CUDA, run with CUDA_VISIBLE_DEVICES=
to disable GPU usage).
@slaren Thats it! With GPU it ruined the lora, with CPU it works as intended! GREAT!
Then the cause may be that the finetune results in some values that cannot be represented in a float16. Maybe it would be a good idea to use BF16 instead in the cuBLAS mat mul.
@slaren Great! Thanks, been scratching my head at this for weeks. Much appreciated!!!
@Sneakr whats the solution?
@gamercoder153 add this before conversion command CUDA_VISIBLE_DEVICES=0 to run conversion on the CPU . Edit: IT's a temporary solution but does not fully fix the bfloat issue but working alteast.
You should merge the model with pytorch and then convert the merged model to gguf. The lora conversion script has issues exporting tensors that are permuted during model conversion, and it should probably be removed.
Beginning to notice a pattern around here...
Then the cause may be that the finetune results in some values that cannot be represented in a float16. Maybe it would be a good idea to use BF16 instead in the cuBLAS mat mul.
Is this more evidence BF16 should be added to the convert scripts? I've been converting BF16 to float32. Does that mitigate these issues? Of course it's not ideal, but if it works, I'll continue doing it until BF16 is natively available
Update: Altough the cpu conversion worked better, it still losees valueable data from the fine tunes after experimenting further.
I hope they solve the issue somehow
Then the cause may be that the finetune results in some values that cannot be represented in a float16. Maybe it would be a good idea to use BF16 instead in the cuBLAS mat mul.
I think if that was the case then the output would just be NaN incoherent garbage like it would be with Phi-2. My guess is that this is a difference in rounding error, not necessarily even from the precision of the weights but possibly from other operations as well. In any case, an insufficient numerical range could be tested by using FP32 instead of FP16 cuBLAS matrix multiplication.
@slaren further testing, I exported to GGUF : CUDA_VISIBLE_DEVICES=0 python ./llama.cpp/convert-hf-to-gguf.py ./xmerge/NewModel --outfile ./xmerge/NewModel/NewModel_F32.gguf --outtype f32
Edit: Fine tuning was done in bfloat16 , I need to test float16 next. CUDA_VISIBLE_DEVICES="" empty string same outcome
This is the result of loading the model directly through safetensors and then load it in GGUF - LM Studio . The model is fine tuned with lora adapter.
Hey everyone, I managed to make a minimal reproduction of this issue with unsloth. Uses a single sample.
@slaren further testing, I exported to GGUF : CUDA_VISIBLE_DEVICES=0 python ./llama.cpp/convert-hf-to-gguf.py ./xmerge/NewModel --outfile ./xmerge/NewModel/NewModel_F32.gguf --outtype f32
CUDA_VISIBLE_DEVICES=0
does not disable GPUs; it limits you to the first GPU by device index.
Have you tried this with setting CUDA_VISIBLE_DEVICES=
(empty string)?
@salieri Yes sorry, I pasted the wrong command here, I ran it with "" empty string same outcome. Thanks!
Currently testing @gabriel-peracio's code as well :)
I repeated the training in FP16 and results are even worse (in the video, I also disable flash attention in llama.cpp (-fa
):
My take: the videos shows that you can overfit a model to produce a single given reply. If you use the exact same code for training and for inference you get the overfit reply. If you use different code you do not get the exact same reply. This in my opinion does not show a general issue with inference. It only shows that unsloth and llama.cpp do not produce bit-for-bit identical results. So something very specific and fragile like the exact reproduction of the ASCII art in the training data breaks. But this does not mean that general inference suffers the same problem where the distribution of good outputs is much wider and therefore more robust.
In the OP it was reported that the fine-tuned model did not answer "correctly" some of the questions that I assume were in the training data while maintaining the general style of the training data. This is what I would expect to happen if you were to for example simply add some noise to the inference. This is what happens in effect if you change the inference code and also if you apply any quantization. I very much suspect that if you were to use any other inference backend you would run into this same issue.
Ultimately I think this issue is fundamentally unfixable unless training code is added to llama.cpp and even then you would only get exact reproductions of the training data with this particular inference backend.
@JohannesGaessler Thanks for your insight. However, I doubt this is an inference issue. The issue happens only with the GGUF converted model, not different inference methods. The style that it retains is completely random, sometimes it loses most of its style and reverts back to the base model , sometimes less.
It seems to be some issue with the conversion to GGUF, as I'm converting it to f32 on the CPU, there shouldnt be any precision or quantization loss that would affect the outcome.
It is not about answering "correctly" , but rather that it has an overfit data and this should not happen. It's like taking the base instruct model from Meta HF page directly and ask it who it is and who created it, it will always hint at Meta and LLAMA and that it is an AI. Because this has been trained into it.
There's a clear issue with the GGUF conversion, this is not a mere forgetting or one or two questions.
The issue happens only with the GGUF converted model, not different inference methods. The style that it retains is completely random, sometimes it loses most of its style and reverts back to the base model , sometimes less.
Which backends did you test?
It seems to be some issue with the conversion to GGUF, as I'm converting it to f32 on the CPU, there shouldnt be any precision or quantization loss that would affect the outcome.
It's not about the file format, it's about the inference code. My suspicion is that the random differences caused by a difference in inference code is what actually breaks exact reproductions of training data.
It is not about answering "correctly" , but rather that it has an overfit data and this should not happen. It's like taking the base instruct model from Meta HF page directly and ask it who it is and who created it, it will always hint at Meta and LLAMA and that it is an AI. Because this has been trained into it.
And presumably Meta has thrown a lot more compute and training data at their instruct model than you did for your LoRA. My expectation therefore would be that given even a slight perturbation of the results the model reverts back to the Meta finetune behavior.
Which backends did you test?
For inference, llama.cpp, ollama, lm studio
It's not about the file format, it's about the inference code. My suspicion is that the random differences caused by a difference in inference code is what actually breaks exact reproductions of training data.
It shouldn't be about the file format, but in this case, it seems that is the case based on the script that converts the model to the specific format, in this case GGUF. I'm yet to test other formats as I'm on it now, AWQ to start with.
And presumably Meta has thrown a lot more compute and training data at their instruct model than you did for your LoRA. My expectation therefore would be that given even a slight perturbation of the results the model reverts back to the Meta finetune behavior.
That's not how QLORA and LORA fine tuning works. You don't need 100K H100 GPU's for fine tuning a model to remember how to speak, what style to speak, or what identity it has.
This isn't a slight perturbation, in multiple cases it's a BIG difference , it's like it's not even been trained, with slight perturbation towards the training data.
I'm only presenting the issues which has been verified by others who are testing it simultaniously, feel free to test it out yourself and post your findings, speculation without any further testing does not lead anywhere forward, especially when your assumptions are incorrect in this case, it is not slight deviation from the training data, it is pretty much huge deviations somethings more sometimes less.
For inference, llama.cpp, ollama, lm studio
ollama and LMStudio internally both use llama.cpp so all of these use the same inference code.
That's not how QLORA and LORA fine tuning works. You don't need 100K H100 GPU's for fine tuning a model to remember how to speak, what style to speak, or what identity it has.
If you were starting from a base model I would agree. But if you start from an instruct tune you are going to get competing responses that the model is supposed to give. I think a LoRA is just not going to cover the entire parameter space that a full finetune has affected. And especially if Meta has used e.g. Dropout for their instruct tune (I think the LLaMA 3 research paper has still not been released) then the model is going to learn redundant representations of whatever behavior Meta wants. If you add some training on top you will be able to make the model give different responses. But I expect this to be fragile and to break when small amounts of random noise are added in the middle of the evaluation. You are going to get such random noise simply from changing the order of floating point operations, or from changing the data type of the KV cache (which is by default FP16, can be changed to FP32 via CLI args), or from using FP32 instead of BF16 accumulators for sums (this cannot be changed). This is what I meant by "slight perturbation". I'm not talking about the end result, I'm talking about the small changes in the middle which for complex numerical calculations can frequently lead to dramatically different end results.
I'm only presenting the issues which has been verified by others who are testing it simultaniously, feel free to test it out yourself and post your findings, speculation without any further testing does not lead anywhere forward, especially when your assumptions are incorrect in this case, it is not slight deviation from the training data, it is pretty much huge deviations somethings more sometimes less.
Sorry, but I disagree. I don't need to present any evidence myself in order to express that I disagree with the conclusions drawn from the evidence that other people present.
@gabriel-peracio, could you run again the same test but with CPU backend and not GPU backend, as slaren pointed out? Are the results the same?
@JohannesGaessler Thanks for your valueable insight, but I doubt this is the case here.
ollama and LMStudio internally both use llama.cpp so all of these use the same inference code.
Yes, that's exactly why I created this issue with the topic GGUF issue in the llama.cpp repo. Because it is related to this repo.
Everything works fine running inference directly against the safetensors, using unsloth or torch. Using llama.cpp breaks the lora fine tune, hence, the GGUF issues and llama.cpp.
If the noise would disturb the fine tuning, I assume it would do that regardless because it does not appear as an issue with anything else than llama.cpp for now (I'm on to test other formats AWQ soon) , hence, the issue to investiage why llama.cpp would cause these issues (or noises if you want to call it that ofc)
@abc-nix
Same issue on CPU only. Using the FP16 trained model (no bf16) and no flash attention
@abc-nix
Same issue on CPU only. Using the FP16 trained model (no bf16) and no flash attention CPU only (
-ngl 0
)
Even with no layers offloaded, I believe it still uses the gpu backend for prompt processing. Sorry to disturb you again, but could you run it with CUDA_VISIBLE_DEVICES=
to hide all CUDA devices and try again?
I am very sorry that I am only demanding stuff and pushing all the work to you. Thank you.
I also tried saving the LoRA separately (fp16) and converting it using python convert-lora-to-ggml.py /mnt/d/LLM_Models/LoRA/model/
Applying the LoRA to Meta-Llama-3-8B-Instruct-fp16.gguf
with --lora
:
Same issue.
I just noticed, for these tests you are setting neither a seed nor a temperature. What happens if you set --temp 0.0f
?
@abc-nix
Using $env:CUDA_VISIBLE_DEVICES = "1"
(powershell, sorry)
No change, same issue
@JohannesGaessler Just tried --temp 0.0f
(CPU only again):
.\main.exe --numa numactl -c 2048 -e -m D:\LLM_Models\model-unsloth.F16.gguf --override-kv tokenizer.ggml.pre=str:llama3 --temp 0.0f -p "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n!!llama.cpp!!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
I won't bother posting the video this time, same thing. Broken.
Thanks, @gabriel-peracio for testing this.
Ok this is huge comfirmation, I quantize the model to AWQ 4 bit and this is the output exactly as intended, compared to broken GGUF:
Can you also check Aphrodite Engine? To my knowledge that framework is capable of loading GGUF files but (with the exception of quantized models) is not going to use any of the llama.cpp inference code.
@JohannesGaessler I tried to get it working previously but never got to get it running, I'm new to all of this lol
Given the new evidence I'm thinking this could be an issue with tokenization. Can you check llama.cpp vs. llama.cpp_hf in Oobabooga?
Also just to make sure: you are testing with temperature 0 in order to rule out issues with different sampling settings, right?
@JohannesGaessler
Given the new evidence I'm thinking this could be an issue with tokenization.
You noticed I was using --override-kv tokenizer.ggml.pre=str:llama3
in my examples, right? I'm not sure it makes any difference but I'm calling your attention to this, you probably know more than me
you are testing with temperature 0 in order to rule out issues with different sampling settings, right?
I was not, but I did try once (see previous msg) and it didn't make any difference.
I'm trying to get aphrodite running here, my cuda is borked on WSL2 I was doing everything on windows
You noticed I was using --override-kv tokenizer.ggml.pre=str:llama3 in my examples, right? Just making sure
Yes, but maybe there is an issue with the llama.cpp code regarding that.
I'm not sure it makes any difference but I'm calling your attention to this, you probably know more than me
I am a dev working on the project but my expertise is more on the low-level computational side rather than e.g. the tokenization. But if we can narrow down what exactly is happening that will make it much easier to get the right people involved.
Could this problem be BPE vocab related? llama3 is the only model that I've seen that uses --vocab-type bpe Also, I've noticed that the garbling happens specifically on new lines.
AFAIK @Sneakr is not using --override-kv
and on my own tests it seems to not make a difference either.
In addition, I've heard reports of the same thing happening with mi(s|x)tral, but I'm not going to test that one 😝
I downloaded Oobabooga and got two different results for the same prompt and the llama.cpp and llama.cpp_HF loaders. The GUI only lets you set the temperature to 0.01 at the lowest but I'm consistently getting these same two different replies so I don't think this is the issue.
I'm thinking this could be a tokenization issue; I'll open an issue on the Oobabooga Github and ask for advice there.
@JohannesGaessler Thanks for looking into it! I will try Oobabooga too
In my particular case the difference is definitely the tokenization, I noticed that there is a tab that lets you check it.
Although for the GGUF conversion I had to apply a hack because the conversion script for whatever reason doesn't work correctly on my system so it may be that this is an unrelated issue.
When I fed the same prompt to the llama.cpp tokenize
binary I get the correct tokenization:
128000 -> '<|begin_of_text|>'
128000 -> '<|begin_of_text|>'
128006 -> '<|start_header_id|>'
9125 -> 'system'
128007 -> '<|end_header_id|>'
271 -> '
'
32 -> 'A'
6369 -> ' chat'
1990 -> ' between'
264 -> ' a'
22999 -> ' curious'
1217 -> ' user'
323 -> ' and'
459 -> ' an'
21075 -> ' artificial'
11478 -> ' intelligence'
18328 -> ' assistant'
13 -> '.'
So these are possibly two different issues. But in any case, I think it's worthwhile to check that the prompt you're using for testing is being properly tokenized.
@JohannesGaessler
Ok I tested the GGUF model (F32) in oobabooga, and the outcome was the same as llama.cpp, the fine tuning was not present. However, I also copied the AWQ to oobagooba and when loaded, it produced the same inference, altough it did work as intended as I showed in my previous screenshot when running it in python using this code:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer
quant_path = "./xmerge/NewModel/newmodel-awq"
# Load model
model = AutoAWQForCausalLM.from_quantized(quant_path, fuse_layers=True)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
prompt = "Who are you? "
chat = [
{"role": "system", "content": ""},
{"role": "user", "content": prompt},
]
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
tokens = tokenizer.apply_chat_template(
chat,
return_tensors="pt"
).cuda()
# Generate output
generation_output = model.generate(
tokens,
streamer=streamer,
max_new_tokens=64,
eos_token_id=terminators
)
I think you might be on to something, there could be some issue with tokenization. This is my first ever AWQ quant and running AWQ ever, I need someone to verify this with the notebook.
@JohannesGaessler Great findings! This led me to fix the issue!!
Here's how it worked as expected in oobabooga, with both GGUF and AWQ produces same issue and it was indeed a tokenization issue.
1: Had to add custom stopping token
2: The template: https://github.com/mamei16/LLM_Web_search/blob/main/instruction_templates/Llama-3.yaml
{%- set ns = namespace(found=false) -%}
{%- for message in messages -%}
{%- if message['role'] == 'system' -%}
{%- set ns.found = true -%}
{%- endif -%}
{%- endfor -%}
{%- for message in messages %}
{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}
{% if loop.index0 == 0 %}
{% set content = '<|begin_of_text|>' + content %}
{% endif %}
{{- content -}}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{- '<|start_header_id|>' + 'assistant' + '<|end_header_id|>\n\n' -}}
{%- endif -%}
With this, it worked fine with ooba!! (Not sure to what degree altough it seems to work fine so far)
Would be good to verify this from more parties! Great work again @JohannesGaessler
I'm running Unsloth to fine tune LORA the Instruct model on llama3-8b .
1: I merge the model with the LORA adapter into safetensors 2: Running inference in python both with the merged model directly or the unsloth loaded model with the adapter on top of it produces correct outputs as per the fine tune
Bug: GGUF conversion of the merged model does not produce the same output. The GGUF has lost some of its fine tune data, while still maintaining most of it.
I can ask it who it is, who created it etc. And it responds Llama and Meta as usual, but it incorporates the fine tuned speech style and humor into the response. This is not the case for my fine tuned model.
1: I tried merging the LORA adapter with the original GGUF (non-fine tuned) using llama.cpp, the same results. 2: I tried running the server on the original GGUF (non-fine tuned) usling llama.cpp server and the adapter loaded into the server terminal command - same results.
It seemes that GGUF conversion is losing fine tuned data randomly during conversion.
If this is the case, all GGUF converts of the fine tuned models are basically out the window. And the question is how much the non-fine tuned models are affected by this.
I've tried F16, Q8, same issues.
This is not a quantization issue as I get the exact same results running FP16 as well as 4-bit in python running HF loader or Unsloth, both works fine as mentioned.