ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.95k stars 9.75k forks source link

Finetune produces unusable LoRA for Mistral model #3852

Closed maxxk closed 7 months ago

maxxk commented 1 year ago

Expected Behavior

I expected finetune to produce a usable LoRA adapter for all supported models.

Current Behavior

For Mistral models (I tried both Mistral and Zephyr, Q8_0, Q5_K_M, Q5_0) model outputs gibberish with LoRA after a single finetune iteration.

On the same PC finetuning produces usable LoRA adapter for TinyLlama (I tried Q8_0, Q5_K_M, Q5_0).

First few tokens for "Building a website can be done in 10 simple steps:" prompt:

Base Mistral model:

Building a website can be done in 10 simple steps:
1. Come up with an idea for your site.
2. Do some research on the web to see what’s out there.

Mistral with LoRA (single finetune iteration on shakespeare.txt from example):

Building a website can be done in 10 simple steps: (3 . in.
 A,
! (
 P! A,  PAM,IT A) MER W W 0

Environment and Context

Core i7 4770 CPU

$ lscpu

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
    CPU family:          6
    Model:               60
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            3
    CPU(s) scaling MHz:  100%
    CPU max MHz:         3900.0000
    CPU min MHz:         800.0000
    BogoMIPS:            6784.88
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs
                          bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt
                         aes xsave avx f16c rdrand lahf_lm abm cpuid_fault invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm xsav
                         eopt dtherm ida arat pln pts
Virtualization features:
  Virtualization:        VT-x
Caches (sum of all):
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    1 MiB (4 instances)
  L3:                    8 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-7
Vulnerabilities:
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
  Mds:                   Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Unknown: No mitigations
  Retbleed:              Not affected
  Spec store bypass:     Vulnerable
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Vulnerable: No microcode
  Tsx async abort:       Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable

$ uname -a

Linux maxxk-pc 6.1.29 #1-NixOS SMP PREEMPT_DYNAMIC Wed May 17 09:54:00 UTC 2023 x86_64 GNU/Linux

Failure Information (for bugs)

For Mistral models (I tried both Mistral and Zephyr, Q8_0, Q5_K_M, Q5_0) model outputs gibberish with LoRA after a single finetune iteration.

Steps to Reproduce

I used pre-converted models from TheBloke:

This issue can be reproduced using shakespeare.txt from finetune example, but I got same results for a different dataset.

Finetuning command:

../llama.cpp/bin/finetune \
  --model-base mistral-7b-v0.1.Q8_0.gguf \
  --train-data shakespeare.txt  \
  --lora-out lora-Q8_0.gguf \
  --save-every 1 \
  --threads 4 \
  --ctx 64 \
  --batch 1 \
  --grad-acc 1 \
  --lora-r 64 \
  --lora-alpha 64 \
  --adam-iter 1 \
  --use-checkpointing \
  --use-flash \
  --escape \
  --seed 1

For Zephyr (also produces invalid LoRA) and TinyLlama (produces valid LoRA) I changed only model-base parameter. Between experiments I removed all finetune checkpoints and LoRAs.

Testing without LoRA:

../llama.cpp/bin/main -m ./mistral-7b-v0.1.Q8_0.gguf -p "Building a website can be done in 10 simple steps:"

Testing with LoRA:

 ../llama.cpp/bin/main -m ./mistral-7b-v0.1.Q8_0.gguf -p "Building a website can be done in 10 simple steps:" --lora ./lora-Q8_0.gguf

P.S. As a final part of this bug report I would like to thank all contributors for this amazing piece of software. It is a pleasure to use, and it gives an ability to experiment with LLMs even for those of us without top GPUs.

maxxk commented 1 year ago

I played with finetuning parameters for Zephyr model and got the following result.

  1. Converted LoRA model (rank 32, alpha 64) works fine. But that LoRA applies adapter only to "q_proj", "k_proj", "v_proj", "o_proj" layers and it has significantly less coeffecients than full LoRA from finetune example.

  2. Garbage output starts at rank 24. At rank 23 model still makes perfect sense. rank 4 - ok rank 8 - ok rank 16 - ok rank 20 - ok rank 22 - ok rank 23 - ok rank 24 - garbage (but some words/phrases almost do make sense, but are not english) rank 32 - garbage (random characters)

  3. Rank 24 with scale 0.5 or less also works fine: rank 24 with scale 0.125 - ok rank 24 with scale 0.25 - ok rank 24 with scale 0.5 - ok rank 24 with scale 0.55 - almost ok (how to install Joomla locally) rank 24 with scale 0.6 - almost ok (flask tutorial) rank 24 with scale 0.7 - low quality (offtopic but about creating a website) rank 24 with scale 0.75 - garbage (offtopic with some typos) rank 24 with scale 0.9 - garbage (nonsensical english) rank 24 with scale 1 - garbage (nonsensical english)

Maybe it is just not enough training for random LoRA to make any sense?

I tried rank 24 with 133 iterations (loss 11 -> 5.8), and even scale 0.5 and 0.3 now produces garbage, even with prompt from shakespeare.txt (no way to know whether it was in training sample for LoRA). So, It doesn't look like more training makes things better.

dduenker commented 1 year ago

I experienced the same, but tried it with rank 4 and more iterations, so it took some time, until i hit it.

Edit: can confirm, that higher rank breaks it after one iteration. AMD CPU here.

lscpu

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         43 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  32
  On-line CPU(s) list:   0-31
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen Threadripper 1950X 16-Core Processor
    CPU family:          23
    Model:               1
    Thread(s) per core:  2
    Core(s) per socket:  16
    Socket(s):           1
    Stepping:            1
    Frequency boost:     enabled
    CPU(s) scaling MHz:  67%
    CPU max MHz:         3400.0000
    CPU min MHz:         2200.0000
    BogoMIPS:            6789.13
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_op
                         t pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf rapl pni pclmulqdq monitor sss
                         e3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misaligns
                         se 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall
                         fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv
                         svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_r
                         ecov succor smca sev
Virtualization features:
  Virtualization:        AMD-V
Caches (sum of all):
  L1d:                   512 KiB (16 instances)
  L1i:                   1 MiB (16 instances)
  L2:                    8 MiB (16 instances)
  L3:                    32 MiB (4 instances)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-31
Vulnerabilities:
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Mitigation; untrained return thunk; SMT vulnerable
  Spec rstack overflow:  Mitigation; safe RET
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

uname -a

Linux pc 6.5.9-arch2-1 #1 SMP PREEMPT_DYNAMIC Thu, 26 Oct 2023 00:52:20 +0000 x86_64 GNU/Linux

maxxk commented 1 year ago

I hoped that #3974 will fix this issue, but rank 32 on b1497 still gives garbage output after one iteration.

AndrewGodfrey commented 1 year ago

Are you sure this is Mistral/Zephyr specific? Is this "single finetune iteration" test something you've done on LLama models without issue?

I have a general fix, #4082, and I'm wondering if it fixes your issue, especially since you're looking at a single iteration. My testing failed to show much improvement but I was doing 30 iterations. I did notice that 'loss' was lower for the first few finetune iterations, after my fix.

AndrewGodfrey commented 1 year ago

Update: I've experimented with a single finetune iteration, and: a) I repro the bug on openllama-3b-v2 b) My fix seems to fix it.

Details: These are both f16 base + lora 1 iteration trained on shakespeare.txt

Without the fix
``` > Describe in bullet points, how to design a web page. - [ ] HTML5 is the most popular markup language for creating websites and mobile apps today because it's easy to learn and use. It also has many features that make it easier than ever before to create beautiful, responsive sites with rich media content like videos, audio, images, and more. ### Instruction: ```
With the fix
``` > Describe in bullet points, how to design a web page. - The first thing you need is the domain name and hosting service provider (hosting). You can get this from your ISP or any other company that provides internet services for businesses like yours. Once you have these two things in place then it's time to start designing a website! This will be done using HTML, CSS, JavaScript etc... - The next step is creating the content of each page on your site (text, images and videos). You can do this by using a text editor like Notepad or WordPad. Once you have all of your content ready then it's time to upload it onto your server! This will be done using FTP software such as FileZilla or WinSCP etc... - The final step is to make sure that everything works properly on your site before publishing it online (testing). You can do this by using a web browser such as Internet Explorer, Firefox etc... Once you're happy with the results then publish it! This will be done using an FTP client like FileZilla or WinSCP. - If you have any questions about how to design a website please feel free to ask in the comments below :) ```
dduenker commented 1 year ago

@AndrewGodfrey I just tried a single finetune iteration without your fix and one with your fix. The result i got with your branch did indeed prevent the nonsense. I will train some more iterations for the next hours, to see, what i can get. :)

maxxk commented 1 year ago

It's strange but I can reproduce the issue only on Mistral, both with master and #4082 branches. So maybe Mistral architecture is a problem (it is a bit different from llama, iirc).

I tried tinyllama before submitting the issue and openllama-3b-v2 (on current master and on PR branch), for both of them single-iteration-trained lora resulted into almost fine output. @AndrewGodfrey actually in your example output without the fix is not that bad too. And finetune program refused to work on f16 for me, so for my test I quantized f16 to q8_0.

For mistral/zephyr on current master and on PR branch result for single-step lora is completely incoherent. This is an example of lora trained using finetune from #4082 branch on Zephyr model:

``` Building a website can be done in 10 simple steps:) “” (insert)” �) “?!” A lot” ( ) ✔” ) : ) )? : ” ( ) !” ```
AndrewGodfrey commented 1 year ago

Yes, I suppose the effect could be worse on some models than others. You are also using an unusually high value for lora-r, of 64 (at least, when compared to the default of 4, and the values explored in the LoRA paper).

AndrewGodfrey commented 12 months ago

Oops I misread your earlier report. So my fix didn’t help with Mistral.

Something I just realized may be true (but I haven’t tried it yet) is that the “train” example can be used for fine-tuning. The example named “finetune” is specifically for LoRA finetuning. So I wonder if this repros with “train” or is specific to the LoRA case. Again, this is my understanding from reading train.cpp recently but I haven’t tried it myself yet.

github-actions[bot] commented 7 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.