Closed maxxk closed 7 months ago
I played with finetuning parameters for Zephyr model and got the following result.
Converted LoRA model (rank 32, alpha 64) works fine. But that LoRA applies adapter only to "q_proj", "k_proj", "v_proj", "o_proj" layers and it has significantly less coeffecients than full LoRA from finetune example.
Garbage output starts at rank 24. At rank 23 model still makes perfect sense. rank 4 - ok rank 8 - ok rank 16 - ok rank 20 - ok rank 22 - ok rank 23 - ok rank 24 - garbage (but some words/phrases almost do make sense, but are not english) rank 32 - garbage (random characters)
Rank 24 with scale 0.5 or less also works fine: rank 24 with scale 0.125 - ok rank 24 with scale 0.25 - ok rank 24 with scale 0.5 - ok rank 24 with scale 0.55 - almost ok (how to install Joomla locally) rank 24 with scale 0.6 - almost ok (flask tutorial) rank 24 with scale 0.7 - low quality (offtopic but about creating a website) rank 24 with scale 0.75 - garbage (offtopic with some typos) rank 24 with scale 0.9 - garbage (nonsensical english) rank 24 with scale 1 - garbage (nonsensical english)
Maybe it is just not enough training for random LoRA to make any sense?
I tried rank 24 with 133 iterations (loss 11 -> 5.8), and even scale 0.5 and 0.3 now produces garbage, even with prompt from shakespeare.txt (no way to know whether it was in training sample for LoRA). So, It doesn't look like more training makes things better.
I experienced the same, but tried it with rank 4 and more iterations, so it took some time, until i hit it.
Edit: can confirm, that higher rank breaks it after one iteration. AMD CPU here.
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: AuthenticAMD
Model name: AMD Ryzen Threadripper 1950X 16-Core Processor
CPU family: 23
Model: 1
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
Stepping: 1
Frequency boost: enabled
CPU(s) scaling MHz: 67%
CPU max MHz: 3400.0000
CPU min MHz: 2200.0000
BogoMIPS: 6789.13
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_op
t pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf rapl pni pclmulqdq monitor sss
e3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misaligns
se 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall
fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv
svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_r
ecov succor smca sev
Virtualization features:
Virtualization: AMD-V
Caches (sum of all):
L1d: 512 KiB (16 instances)
L1i: 1 MiB (16 instances)
L2: 8 MiB (16 instances)
L3: 32 MiB (4 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-31
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Mitigation; untrained return thunk; SMT vulnerable
Spec rstack overflow: Mitigation; safe RET
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
uname -a
Linux pc 6.5.9-arch2-1 #1 SMP PREEMPT_DYNAMIC Thu, 26 Oct 2023 00:52:20 +0000 x86_64 GNU/Linux
I hoped that #3974 will fix this issue, but rank 32 on b1497 still gives garbage output after one iteration.
Are you sure this is Mistral/Zephyr specific? Is this "single finetune iteration" test something you've done on LLama models without issue?
I have a general fix, #4082, and I'm wondering if it fixes your issue, especially since you're looking at a single iteration. My testing failed to show much improvement but I was doing 30 iterations. I did notice that 'loss' was lower for the first few finetune iterations, after my fix.
Update: I've experimented with a single finetune iteration, and: a) I repro the bug on openllama-3b-v2 b) My fix seems to fix it.
Details: These are both f16 base + lora 1 iteration trained on shakespeare.txt
@AndrewGodfrey I just tried a single finetune iteration without your fix and one with your fix. The result i got with your branch did indeed prevent the nonsense. I will train some more iterations for the next hours, to see, what i can get. :)
It's strange but I can reproduce the issue only on Mistral, both with master and #4082 branches. So maybe Mistral architecture is a problem (it is a bit different from llama, iirc).
I tried tinyllama before submitting the issue and openllama-3b-v2 (on current master and on PR branch), for both of them single-iteration-trained lora resulted into almost fine output. @AndrewGodfrey actually in your example output without the fix is not that bad too. And finetune program refused to work on f16 for me, so for my test I quantized f16 to q8_0.
For mistral/zephyr on current master and on PR branch result for single-step lora is completely incoherent. This is an example of lora trained using finetune from #4082 branch on Zephyr model:
Yes, I suppose the effect could be worse on some models than others. You are also using an unusually high value for lora-r, of 64 (at least, when compared to the default of 4, and the values explored in the LoRA paper).
Oops I misread your earlier report. So my fix didn’t help with Mistral.
Something I just realized may be true (but I haven’t tried it yet) is that the “train” example can be used for fine-tuning. The example named “finetune” is specifically for LoRA finetuning. So I wonder if this repros with “train” or is specific to the LoRA case. Again, this is my understanding from reading train.cpp recently but I haven’t tried it myself yet.
This issue was closed because it has been inactive for 14 days since being marked as stale.
Expected Behavior
I expected finetune to produce a usable LoRA adapter for all supported models.
Current Behavior
For Mistral models (I tried both Mistral and Zephyr, Q8_0, Q5_K_M, Q5_0) model outputs gibberish with LoRA after a single finetune iteration.
On the same PC finetuning produces usable LoRA adapter for TinyLlama (I tried Q8_0, Q5_K_M, Q5_0).
First few tokens for "Building a website can be done in 10 simple steps:" prompt:
Base Mistral model:
Mistral with LoRA (single finetune iteration on shakespeare.txt from example):
Environment and Context
Core i7 4770 CPU
$ lscpu
$ uname -a
Failure Information (for bugs)
For Mistral models (I tried both Mistral and Zephyr, Q8_0, Q5_K_M, Q5_0) model outputs gibberish with LoRA after a single finetune iteration.
Steps to Reproduce
I used pre-converted models from TheBloke:
This issue can be reproduced using shakespeare.txt from finetune example, but I got same results for a different dataset.
Finetuning command:
For Zephyr (also produces invalid LoRA) and TinyLlama (produces valid LoRA) I changed only model-base parameter. Between experiments I removed all finetune checkpoints and LoRAs.
Testing without LoRA:
Testing with LoRA:
P.S. As a final part of this bug report I would like to thank all contributors for this amazing piece of software. It is a pleasure to use, and it gives an ability to experiment with LLMs even for those of us without top GPUs.