ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.88k stars 9.3k forks source link

Llama3 GGUF conversion with merged LORA Adapter seems to lose training data randomly #7062

Closed Sneakr closed 4 months ago

Sneakr commented 4 months ago

I'm running Unsloth to fine tune LORA the Instruct model on llama3-8b .

1: I merge the model with the LORA adapter into safetensors 2: Running inference in python both with the merged model directly or the unsloth loaded model with the adapter on top of it produces correct outputs as per the fine tune

Bug: GGUF conversion of the merged model does not produce the same output. The GGUF has lost some of its fine tune data, while still maintaining most of it.

I can ask it who it is, who created it etc. And it responds Llama and Meta as usual, but it incorporates the fine tuned speech style and humor into the response. This is not the case for my fine tuned model.

1: I tried merging the LORA adapter with the original GGUF (non-fine tuned) using llama.cpp, the same results. 2: I tried running the server on the original GGUF (non-fine tuned) usling llama.cpp server and the adapter loaded into the server terminal command - same results.

It seemes that GGUF conversion is losing fine tuned data randomly during conversion.

If this is the case, all GGUF converts of the fine tuned models are basically out the window. And the question is how much the non-fine tuned models are affected by this.

I've tried F16, Q8, same issues.

This is not a quantization issue as I get the exact same results running FP16 as well as 4-bit in python running HF loader or Unsloth, both works fine as mentioned.

Sneakr commented 4 months ago

fientuneerror

Adding this here as reference. The model should not remember its creator Meta nor the Llama name. It also lost many of the fine tuned information that was imposed upon it while it still managed to retain the humor, and the speaking style.

slaren commented 4 months ago

You should merge the model with pytorch and then convert the merged model to gguf. The lora conversion script has issues exporting tensors that are permuted during model conversion, and it should probably be removed.

Sneakr commented 4 months ago

@slaren Thank you will try that And update this thread with the result

Sneakr commented 4 months ago

@slaren I can't seem to get it to work with other methods neither, do you by chance have any external guide or link to some guide that demonstrates or documentation how to merge it using pytorch? I'm getting same output regardless of how I save it.

Edit: To clarify, this is only during conversion to GGUF, when merging and loading the safetensors for inference everything works as expected. It is only during merging to GGUF (regardless of quantization, F16---) it becomes like this.

Sneakr commented 4 months ago

@slaren I misunderstood you completely, what you wrote is exactly what I've done. I did not use the llama.cpp lora conversion script. I just merged the lora into the model using https://huggingface.co/docs/peft/main/en/package_reference/lora#peft.LoraModel.merge_and_unload

So far so good.

Then I convert the merged model previously, into GGUF format with llama.cpp , that breaks the model and the lora fine tune, it does not produce the same outputs. The difference seems to be completely random.

slaren commented 4 months ago

Does it work with the CPU backend? (if using a version of llama.cpp built with CUDA, run with CUDA_VISIBLE_DEVICES= to disable GPU usage).

Sneakr commented 4 months ago

@slaren Thats it! With GPU it ruined the lora, with CPU it works as intended! GREAT!

slaren commented 4 months ago

Then the cause may be that the finetune results in some values that cannot be represented in a float16. Maybe it would be a good idea to use BF16 instead in the cuBLAS mat mul.

Sneakr commented 4 months ago

@slaren Great! Thanks, been scratching my head at this for weeks. Much appreciated!!!

gamercoder153 commented 4 months ago

@Sneakr whats the solution?

Sneakr commented 4 months ago

@gamercoder153 add this before conversion command CUDA_VISIBLE_DEVICES=0 to run conversion on the CPU . Edit: IT's a temporary solution but does not fully fix the bfloat issue but working alteast.

oldgithubman commented 4 months ago

You should merge the model with pytorch and then convert the merged model to gguf. The lora conversion script has issues exporting tensors that are permuted during model conversion, and it should probably be removed.

Beginning to notice a pattern around here...

oldgithubman commented 4 months ago

Then the cause may be that the finetune results in some values that cannot be represented in a float16. Maybe it would be a good idea to use BF16 instead in the cuBLAS mat mul.

Is this more evidence BF16 should be added to the convert scripts? I've been converting BF16 to float32. Does that mitigate these issues? Of course it's not ideal, but if it works, I'll continue doing it until BF16 is natively available

Sneakr commented 4 months ago

Update: Altough the cpu conversion worked better, it still losees valueable data from the fine tunes after experimenting further.

gamercoder153 commented 4 months ago

I hope they solve the issue somehow

JohannesGaessler commented 4 months ago

Then the cause may be that the finetune results in some values that cannot be represented in a float16. Maybe it would be a good idea to use BF16 instead in the cuBLAS mat mul.

I think if that was the case then the output would just be NaN incoherent garbage like it would be with Phi-2. My guess is that this is a difference in rounding error, not necessarily even from the precision of the weights but possibly from other operations as well. In any case, an insufficient numerical range could be tested by using FP32 instead of FP16 cuBLAS matrix multiplication.

Sneakr commented 4 months ago

@slaren further testing, I exported to GGUF : CUDA_VISIBLE_DEVICES=0 python ./llama.cpp/convert-hf-to-gguf.py ./xmerge/NewModel --outfile ./xmerge/NewModel/NewModel_F32.gguf --outtype f32

Edit: Fine tuning was done in bfloat16 , I need to test float16 next. CUDA_VISIBLE_DEVICES="" empty string same outcome

This is the result of loading the model directly through safetensors and then load it in GGUF - LM Studio . The model is fine tuned with lora adapter.

ggufvssafetensors
gabriel-peracio commented 4 months ago

Hey everyone, I managed to make a minimal reproduction of this issue with unsloth. Uses a single sample.

Inference on Unsloth (notebook) - Video https://github.com/ggerganov/llama.cpp/assets/8999086/ddbc33c3-3447-4b80-825a-0f97ebf85524
Inference on llama.cpp - Video https://github.com/ggerganov/llama.cpp/assets/8999086/d055a3fb-c5a1-48ea-9967-5922cf1991af
Here is my training data (single sample) ```json { "conversations": [ { "role": "user", "content": "!!llama.cpp!!" }, { "role": "assistant", "content": "................,gNp.y@N,\n...............gP..@.@..\"B,\n..............@C...@.@-...$p\n.............]@....@.@.....@\n.............]@....@.@.....@\n..............@,...$U@....]P\n........,gg@P*\"\"....-....#@\n...,g@P*'.................*@,\n.gP\"........%Np@P..........'@\n]@..........................@\n'B,.........................]@\n..\"*MMNBgg,,.................]C\n...........\"`@`..............$P\n............@P...............Bp\n............@-................@.\n............]@................@P\n...........]@................\"@\n...........@-.................]@\n...........]@.................,@\n........../@-................'@.\n.........]@....................Bp\n.........$P....................]K\n.........,@C..................J@\n......../@`.....................@\n.......]@.......................$P........,,,,,,,,\n.......]L.......................\"**$P\"\"\"\"```\"\"\"\"\"``\"\"*\"*\"*PPRNBggg,\n........$N.........................]$...........................][-\"*NNg,\n.......gF...........................j`'``'```''\"'\"\"\"^`\"\"\"\"*\"\"\"\"\"]@......\"Ng\n......]@............................@.,@*b...,@\"B,...gP%,...,@b,.@........'%W\n.......@,...........................@@\"...]g@C...\"NgP`..\"B,@C..\"N@.......@..]@\n.......]K...........................@g,,,gggggggggggggggggggggggg@.......]P..]@\n.......@............................$P...........................$P......]@...@-\n.......@............................$P....................,,,,,,,$P.......@...$P\n.......\"Bg..........................$P\"```]\"```\"[`\"''',..--]g-.-.@P.......@...@\n........j@..........................]PBggN`%w,gP\"%g,gP\"%wg@\".\"NNP$P.......@..@C\n........][..........................]@.......-..........,,,,,gggg@........@g@'\n........'@...........................`^\"*T\"\"\"\"\"\"\"\"**\"\"*\"`'.`..............@\n.........\"Bw,.............................................................@\n............@.............................................................$\n............]@.....................................g,,.,,@Ngg@P@..........$\n.............\"Ngg,,..............gg,..,ggggg@P*RNP\"]@`\"`....]P.$P.........@\n................-]@.........@BB@P\"-'\"`-@............@.......]P.]@........][\n..................@........]@..@.......@-...........@.......$P.]@........]P\n..................@-.......][..@.......@............@P......@P..@........@\n..................$P.......]P..@.......@............$P......@...@........@\n..................$P.......@`..@......]@............$P......@...@.......]@\n..................]P.......@...@......][............$P......@...@.......]P\n..................][......]@...@......@P............]P.....]@...@.......@-\n..................][......$P..]@......@.............]P.....]P...@-......@\n..................][......@...]@.....]@.............$P.....@P...@......]P\n..................][.....]@...]@.....@P.............$P.....@....@......@-\n..................]@.....@P...][.....@..............$P....]@....@.....]@\n..................][.....@....]@....]P..............@-....@P....@.....@\n..................][....$P....]P....@...............@....]@....]@....]@\n..................]@ggg@P.....]BNBNP`...............*NNN**......Bgg@N\"" } ] } ```
And the parameters used during training Notebook: https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing max_seq_length = 1024 dtype = torch.bfloat16 load_in_4bit = True model_name = "unsloth/llama-3-8b-Instruct-bnb-4bit" r = 8 lora_alpha = 16 num_train_epochs=130 per_device_train_batch_size = 1 gradient_accumulation_steps = 1 You will also need to change the dataloader to pull from JSON
salieri commented 4 months ago

@slaren further testing, I exported to GGUF : CUDA_VISIBLE_DEVICES=0 python ./llama.cpp/convert-hf-to-gguf.py ./xmerge/NewModel --outfile ./xmerge/NewModel/NewModel_F32.gguf --outtype f32

CUDA_VISIBLE_DEVICES=0 does not disable GPUs; it limits you to the first GPU by device index.

Have you tried this with setting CUDA_VISIBLE_DEVICES= (empty string)?

Sneakr commented 4 months ago

@salieri Yes sorry, I pasted the wrong command here, I ran it with "" empty string same outcome. Thanks!

danielhanchen commented 4 months ago

Currently testing @gabriel-peracio's code as well :)

gabriel-peracio commented 4 months ago

I repeated the training in FP16 and results are even worse (in the video, I also disable flash attention in llama.cpp (-fa):

Training in FP16 (instead of BF16) https://github.com/ggerganov/llama.cpp/assets/8999086/88cd20df-56c2-4409-b20b-3f89e075345a
JohannesGaessler commented 4 months ago

My take: the videos shows that you can overfit a model to produce a single given reply. If you use the exact same code for training and for inference you get the overfit reply. If you use different code you do not get the exact same reply. This in my opinion does not show a general issue with inference. It only shows that unsloth and llama.cpp do not produce bit-for-bit identical results. So something very specific and fragile like the exact reproduction of the ASCII art in the training data breaks. But this does not mean that general inference suffers the same problem where the distribution of good outputs is much wider and therefore more robust.

In the OP it was reported that the fine-tuned model did not answer "correctly" some of the questions that I assume were in the training data while maintaining the general style of the training data. This is what I would expect to happen if you were to for example simply add some noise to the inference. This is what happens in effect if you change the inference code and also if you apply any quantization. I very much suspect that if you were to use any other inference backend you would run into this same issue.

Ultimately I think this issue is fundamentally unfixable unless training code is added to llama.cpp and even then you would only get exact reproductions of the training data with this particular inference backend.

Sneakr commented 4 months ago

@JohannesGaessler Thanks for your insight. However, I doubt this is an inference issue. The issue happens only with the GGUF converted model, not different inference methods. The style that it retains is completely random, sometimes it loses most of its style and reverts back to the base model , sometimes less.

It seems to be some issue with the conversion to GGUF, as I'm converting it to f32 on the CPU, there shouldnt be any precision or quantization loss that would affect the outcome.

It is not about answering "correctly" , but rather that it has an overfit data and this should not happen. It's like taking the base instruct model from Meta HF page directly and ask it who it is and who created it, it will always hint at Meta and LLAMA and that it is an AI. Because this has been trained into it.

There's a clear issue with the GGUF conversion, this is not a mere forgetting or one or two questions.

JohannesGaessler commented 4 months ago

The issue happens only with the GGUF converted model, not different inference methods. The style that it retains is completely random, sometimes it loses most of its style and reverts back to the base model , sometimes less.

Which backends did you test?

It seems to be some issue with the conversion to GGUF, as I'm converting it to f32 on the CPU, there shouldnt be any precision or quantization loss that would affect the outcome.

It's not about the file format, it's about the inference code. My suspicion is that the random differences caused by a difference in inference code is what actually breaks exact reproductions of training data.

It is not about answering "correctly" , but rather that it has an overfit data and this should not happen. It's like taking the base instruct model from Meta HF page directly and ask it who it is and who created it, it will always hint at Meta and LLAMA and that it is an AI. Because this has been trained into it.

And presumably Meta has thrown a lot more compute and training data at their instruct model than you did for your LoRA. My expectation therefore would be that given even a slight perturbation of the results the model reverts back to the Meta finetune behavior.

Sneakr commented 4 months ago

Which backends did you test?

For inference, llama.cpp, ollama, lm studio

It's not about the file format, it's about the inference code. My suspicion is that the random differences caused by a difference in inference code is what actually breaks exact reproductions of training data.

It shouldn't be about the file format, but in this case, it seems that is the case based on the script that converts the model to the specific format, in this case GGUF. I'm yet to test other formats as I'm on it now, AWQ to start with.

And presumably Meta has thrown a lot more compute and training data at their instruct model than you did for your LoRA. My expectation therefore would be that given even a slight perturbation of the results the model reverts back to the Meta finetune behavior.

That's not how QLORA and LORA fine tuning works. You don't need 100K H100 GPU's for fine tuning a model to remember how to speak, what style to speak, or what identity it has.

This isn't a slight perturbation, in multiple cases it's a BIG difference , it's like it's not even been trained, with slight perturbation towards the training data.

I'm only presenting the issues which has been verified by others who are testing it simultaniously, feel free to test it out yourself and post your findings, speculation without any further testing does not lead anywhere forward, especially when your assumptions are incorrect in this case, it is not slight deviation from the training data, it is pretty much huge deviations somethings more sometimes less.

JohannesGaessler commented 4 months ago

For inference, llama.cpp, ollama, lm studio

ollama and LMStudio internally both use llama.cpp so all of these use the same inference code.

That's not how QLORA and LORA fine tuning works. You don't need 100K H100 GPU's for fine tuning a model to remember how to speak, what style to speak, or what identity it has.

If you were starting from a base model I would agree. But if you start from an instruct tune you are going to get competing responses that the model is supposed to give. I think a LoRA is just not going to cover the entire parameter space that a full finetune has affected. And especially if Meta has used e.g. Dropout for their instruct tune (I think the LLaMA 3 research paper has still not been released) then the model is going to learn redundant representations of whatever behavior Meta wants. If you add some training on top you will be able to make the model give different responses. But I expect this to be fragile and to break when small amounts of random noise are added in the middle of the evaluation. You are going to get such random noise simply from changing the order of floating point operations, or from changing the data type of the KV cache (which is by default FP16, can be changed to FP32 via CLI args), or from using FP32 instead of BF16 accumulators for sums (this cannot be changed). This is what I meant by "slight perturbation". I'm not talking about the end result, I'm talking about the small changes in the middle which for complex numerical calculations can frequently lead to dramatically different end results.

I'm only presenting the issues which has been verified by others who are testing it simultaniously, feel free to test it out yourself and post your findings, speculation without any further testing does not lead anywhere forward, especially when your assumptions are incorrect in this case, it is not slight deviation from the training data, it is pretty much huge deviations somethings more sometimes less.

Sorry, but I disagree. I don't need to present any evidence myself in order to express that I disagree with the conclusions drawn from the evidence that other people present.

abc-nix commented 4 months ago

@gabriel-peracio, could you run again the same test but with CPU backend and not GPU backend, as slaren pointed out? Are the results the same?

Sneakr commented 4 months ago

@JohannesGaessler Thanks for your valueable insight, but I doubt this is the case here.

ollama and LMStudio internally both use llama.cpp so all of these use the same inference code.

Yes, that's exactly why I created this issue with the topic GGUF issue in the llama.cpp repo. Because it is related to this repo.

Everything works fine running inference directly against the safetensors, using unsloth or torch. Using llama.cpp breaks the lora fine tune, hence, the GGUF issues and llama.cpp.

If the noise would disturb the fine tuning, I assume it would do that regardless because it does not appear as an issue with anything else than llama.cpp for now (I'm on to test other formats AWQ soon) , hence, the issue to investiage why llama.cpp would cause these issues (or noises if you want to call it that ofc)

gabriel-peracio commented 4 months ago

@abc-nix

Same issue on CPU only. Using the FP16 trained model (no bf16) and no flash attention

CPU only (`-ngl 0`) https://github.com/ggerganov/llama.cpp/assets/8999086/7dc28b4d-a1e6-406f-9480-8ae62322679e
abc-nix commented 4 months ago

@abc-nix

Same issue on CPU only. Using the FP16 trained model (no bf16) and no flash attention CPU only (-ngl 0)

Even with no layers offloaded, I believe it still uses the gpu backend for prompt processing. Sorry to disturb you again, but could you run it with CUDA_VISIBLE_DEVICES= to hide all CUDA devices and try again?

I am very sorry that I am only demanding stuff and pushing all the work to you. Thank you.

gabriel-peracio commented 4 months ago

I also tried saving the LoRA separately (fp16) and converting it using python convert-lora-to-ggml.py /mnt/d/LLM_Models/LoRA/model/

Applying the LoRA to Meta-Llama-3-8B-Instruct-fp16.gguf with --lora:

Using standalone LoRA https://github.com/ggerganov/llama.cpp/assets/8999086/bc475176-1cf0-4fb6-acf8-d7325a0dca53

Same issue.

JohannesGaessler commented 4 months ago

I just noticed, for these tests you are setting neither a seed nor a temperature. What happens if you set --temp 0.0f?

gabriel-peracio commented 4 months ago

@abc-nix

Using $env:CUDA_VISIBLE_DEVICES = "1" (powershell, sorry)

No change, same issue

Using pure CPU (no GPU) - still FP16 https://github.com/ggerganov/llama.cpp/assets/8999086/60e25095-9ba0-4828-a7f3-02504282e0aa
gabriel-peracio commented 4 months ago

@JohannesGaessler Just tried --temp 0.0f (CPU only again):

.\main.exe --numa numactl -c 2048 -e -m D:\LLM_Models\model-unsloth.F16.gguf --override-kv tokenizer.ggml.pre=str:llama3 --temp 0.0f -p "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n!!llama.cpp!!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

I won't bother posting the video this time, same thing. Broken.

abc-nix commented 4 months ago

Thanks, @gabriel-peracio for testing this.

Sneakr commented 4 months ago

Ok this is huge comfirmation, I quantize the model to AWQ 4 bit and this is the output exactly as intended, compared to broken GGUF:

awq
JohannesGaessler commented 4 months ago

Can you also check Aphrodite Engine? To my knowledge that framework is capable of loading GGUF files but (with the exception of quantized models) is not going to use any of the llama.cpp inference code.

Sneakr commented 4 months ago

@JohannesGaessler I tried to get it working previously but never got to get it running, I'm new to all of this lol

JohannesGaessler commented 4 months ago

Given the new evidence I'm thinking this could be an issue with tokenization. Can you check llama.cpp vs. llama.cpp_hf in Oobabooga?

Also just to make sure: you are testing with temperature 0 in order to rule out issues with different sampling settings, right?

gabriel-peracio commented 4 months ago

@JohannesGaessler

Given the new evidence I'm thinking this could be an issue with tokenization.

You noticed I was using --override-kv tokenizer.ggml.pre=str:llama3 in my examples, right? I'm not sure it makes any difference but I'm calling your attention to this, you probably know more than me

you are testing with temperature 0 in order to rule out issues with different sampling settings, right?

I was not, but I did try once (see previous msg) and it didn't make any difference.

I'm trying to get aphrodite running here, my cuda is borked on WSL2 I was doing everything on windows

JohannesGaessler commented 4 months ago

You noticed I was using --override-kv tokenizer.ggml.pre=str:llama3 in my examples, right? Just making sure

Yes, but maybe there is an issue with the llama.cpp code regarding that.

I'm not sure it makes any difference but I'm calling your attention to this, you probably know more than me

I am a dev working on the project but my expertise is more on the low-level computational side rather than e.g. the tokenization. But if we can narrow down what exactly is happening that will make it much easier to get the right people involved.

ScottMcNaught commented 4 months ago

Could this problem be BPE vocab related? llama3 is the only model that I've seen that uses --vocab-type bpe Also, I've noticed that the garbling happens specifically on new lines.

gabriel-peracio commented 4 months ago

AFAIK @Sneakr is not using --override-kv and on my own tests it seems to not make a difference either.

In addition, I've heard reports of the same thing happening with mi(s|x)tral, but I'm not going to test that one 😝

JohannesGaessler commented 4 months ago

I downloaded Oobabooga and got two different results for the same prompt and the llama.cpp and llama.cpp_HF loaders. The GUI only lets you set the temperature to 0.01 at the lowest but I'm consistently getting these same two different replies so I don't think this is the issue.

llama.cpp loader ``` <|begin_of_text|><|start_header_id|>system<|end_header_id|> A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.<|eot_id|><|start_header_id|>user<|end_header_id|> how 2 download a car<|eot_id|><|start_header_id|>assistant<|end_header_id|> I think there might be a slight misunderstanding! You can't actually download a car, as it's a physical object that exists in the real world. Cars are manufactured and sold by companies, and they can't be transferred or downloaded digitally. However, if you're asking about how to purchase or acquire a car, I'd be happy to help with that! Here are some general steps you can follow: 1. Research: Look up different car models, read reviews, and compare features to find the one that suits your needs and budget. 2. Check availability: Visit a dealership or check online marketplaces to see if the car you want is available in your area. 3. Test drive: Take the car for a spin to get a feel for how it handles and to ensure it's a good fit for you. 4. Financing: Explore financing options, such as loans or leases, to determine what works best for you. 5. Purchase: Once you've found the right car and secured financing, you can finalize the purchase and take ownership of your new vehicle! If you have any specific questions about the process or need help with a particular step, feel free to ask, and I'll do my best to assist you! ```
llama.cpp_HF loader ``` <|begin_of_text|><|start_header_id|>system<|end_header_id|> A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.<|eot_id|><|start_header_id|>user<|end_header_id|> how 2 download a car<|eot_id|><|start_header_id|>assistant<|end_header_id|> I think there might be a slight misunderstanding! You can't actually download a car, as it's a physical object that exists in the real world. Cars are manufactured and sold by companies, and they can't be transferred or downloaded digitally. However, if you're asking about how to purchase or acquire a car, I'd be happy to help with that! There are several ways to buy a car, depending on your location and preferences. Here are a few options: 1. Visit a car dealership: You can visit a local car dealership and test drive a car to see if it's the right fit for you. The salesperson will guide you through the purchasing process, and you can negotiate the price and terms of the sale. 2. Online marketplaces: Websites like Cars.com, Autotrader.com, or Craigslist.org allow you to browse and purchase cars from private sellers or dealerships. Be cautious when buying from a private seller, and make sure to inspect the car thoroughly before making a purchase. 3. Auctions: You can also attend car auctions, either in-person or online, to bid on cars. This can be a great way to find a deal, but be prepared to negotiate and inspect the car carefully before bidding. 4. Leasing: If you don't want to own a car outright, you can consider leasing one. Leasing allows you to use a car for a set period of time (usually 2-3 years) in exchange for monthly payments. At the end of the lease, you can return the car or purchase it at a predetermined price. Remember to always research the car's history, condition, and market value before making a purchase, and don't hesitate to ask questions or seek advice from a professional if you're unsure about any part of the process. I hope that helps clarify things! Let me know if you have any other questions.assistant I'm glad I could help clarify things for you. It's important to remember that cars are physical objects that can't be downloaded or transferred digitally. If you're in the market for a new set of wheels, I hope you find the perfect ride for your needs and budget. If you have any more questions or need further assistance, feel free to ask! I'm here to help.assistant You're welcome! I'm always here to help and provide information to the best of my abilities. If you have any more questions or need further assistance, don't hesitate to ask.assistant I think ```

I'm thinking this could be a tokenization issue; I'll open an issue on the Oobabooga Github and ask for advice there.

Sneakr commented 4 months ago

@JohannesGaessler Thanks for looking into it! I will try Oobabooga too

JohannesGaessler commented 4 months ago

In my particular case the difference is definitely the tokenization, I noticed that there is a tab that lets you check it.

llama.cpp_HF loader ``` 128000 - '<|begin_of_text|>' 128006 - '<|start_header_id|>' 9125 - 'system' 128007 - '<|end_header_id|>' 271 - '\n\n' 32 - 'A' 6369 - ' chat' 1990 - ' between' 264 - ' a' 22999 - ' curious' 1217 - ' user' 323 - ' and' 459 - ' an' 21075 - ' artificial' 11478 - ' intelligence' 18328 - ' assistant' 13 - '.' ```
llama.cpp loader ``` 27 - '<' 91 - '|' 7413 - 'begin' 3659 - '_of' 4424 - '_text' 91 - '|' 1822 - '><' 91 - '|' 2527 - 'start' 8932 - '_header' 851 - '_id' 91 - '|' 29 - '>' 9125 - 'system' 27 - '<' 91 - '|' 408 - 'end' 8932 - '_header' 851 - '_id' 91 - '|' 29 - '>' 198 - '\n' 198 - '\n' 32 - 'A' 6369 - ' chat' 1990 - ' between' 264 - ' a' 22999 - ' curious' 1217 - ' user' 323 - ' and' 459 - ' an' 21075 - ' artificial' 11478 - ' intelligence' 18328 - ' assistant' 13 - '.' ```

Although for the GGUF conversion I had to apply a hack because the conversion script for whatever reason doesn't work correctly on my system so it may be that this is an unrelated issue.

JohannesGaessler commented 4 months ago

When I fed the same prompt to the llama.cpp tokenize binary I get the correct tokenization:

128000 -> '<|begin_of_text|>'
128000 -> '<|begin_of_text|>'
128006 -> '<|start_header_id|>'
  9125 -> 'system'
128007 -> '<|end_header_id|>'
   271 -> '

'
    32 -> 'A'
  6369 -> ' chat'
  1990 -> ' between'
   264 -> ' a'
 22999 -> ' curious'
  1217 -> ' user'
   323 -> ' and'
   459 -> ' an'
 21075 -> ' artificial'
 11478 -> ' intelligence'
 18328 -> ' assistant'
    13 -> '.'

So these are possibly two different issues. But in any case, I think it's worthwhile to check that the prompt you're using for testing is being properly tokenized.

Sneakr commented 4 months ago

@JohannesGaessler

Ok I tested the GGUF model (F32) in oobabooga, and the outcome was the same as llama.cpp, the fine tuning was not present. However, I also copied the AWQ to oobagooba and when loaded, it produced the same inference, altough it did work as intended as I showed in my previous screenshot when running it in python using this code:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer

quant_path = "./xmerge/NewModel/newmodel-awq"

# Load model
model = AutoAWQForCausalLM.from_quantized(quant_path, fuse_layers=True)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

prompt = "Who are you? "

chat = [
    {"role": "system", "content": ""},
    {"role": "user", "content": prompt},
]

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

tokens = tokenizer.apply_chat_template(
    chat,
    return_tensors="pt"
).cuda()

# Generate output
generation_output = model.generate(
    tokens,
    streamer=streamer,
    max_new_tokens=64,
    eos_token_id=terminators
)

I think you might be on to something, there could be some issue with tokenization. This is my first ever AWQ quant and running AWQ ever, I need someone to verify this with the notebook.

Sneakr commented 4 months ago

@JohannesGaessler Great findings! This led me to fix the issue!!

Here's how it worked as expected in oobabooga, with both GGUF and AWQ produces same issue and it was indeed a tokenization issue.

1: Had to add custom stopping token

custom stopping strings

2: The template: https://github.com/mamei16/LLM_Web_search/blob/main/instruction_templates/Llama-3.yaml

  {%- set ns = namespace(found=false) -%}
  {%- for message in messages -%}
      {%- if message['role'] == 'system' -%}
          {%- set ns.found = true -%}
      {%- endif -%}
  {%- endfor -%}
  {%- for message in messages %}
      {% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}
      {% if loop.index0 == 0 %}
          {% set content = '<|begin_of_text|>' + content %}
      {% endif %}
      {{- content -}}
  {%- endfor -%}
  {%- if add_generation_prompt -%}
      {{- '<|start_header_id|>' + 'assistant' + '<|end_header_id|>\n\n' -}}
  {%- endif -%}

With this, it worked fine with ooba!! (Not sure to what degree altough it seems to work fine so far)

Would be good to verify this from more parties! Great work again @JohannesGaessler