ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.02k stars 9.32k forks source link

Regressions on IQ3_XXS over time #5856

Open GlasslessPizza opened 6 months ago

GlasslessPizza commented 6 months ago

If I quantize this gguf with this imatrix using this command:

quantize.exe --allow-requantize --imatrix mixtral-8x7b-instruct-v0.1.imatrix mixtral-8x7b-instruct-v0.1.Q8_0.gguf mixtral-8x7b-instruct-v0.1.IQ3_XXS.gguf IQ3_XXS

and I calculate perplexity with this command:

perplexity.exe -f wiki.test.raw --chunks 1000 --seed 42 --threads 8 --log-disable --no-mmap --mlock --ctx-size 512 --n-gpu-layers 999 --model mixtral-8x7b-instruct-v0.1.IQ3_XXS.gguf

I get three much different PPL values on three different versions of quantize.exe, everything else being equal:

b2037 31-1-2024 : 4.7009 +/- 0.02569
b???? 25-2-2024 : 4.7249 +/- 0.02576
b2329 03-3-2024 : 4.8491 +/- 0.02636

I suspect that there have been multiple cumulative regression events on the IQ3_XXS quantization implementation between b2037 and b2329.

cu12.2.0 on Windows 10.

schmorp commented 6 months ago

Out of curiosity, did the resulting gguf sizes also change?

GlasslessPizza commented 6 months ago

Out of curiosity, did the resulting gguf sizes also change?

Not enough to justify the difference:

b2037 31-1-2024 : 18,308,777,920 bytes
b???? 25-2-2024 : 18,307,082,176 bytes
b2329 03-3-2024 : 18,240,407,488 bytes
Artefact2 commented 6 months ago

Can you try just before #5829?

GlasslessPizza commented 6 months ago

Can you try just before #5829?

Sure, that would be b2314:

b2037 | 31-jan-2024 | 4.7009 +/- 0.02569 | 18,308,777,920 bytes
b???? | 25-feb-2024 | 4.7249 +/- 0.02576 | 18,307,082,176 bytes
b2314 | 02-mar-2024 | 4.8530 +/- 0.02642 | 18,240,407,488 bytes
b2329 | 03-mar-2024 | 4.8491 +/- 0.02636 | 18,240,407,488 bytes

Almost indistinguishable from b2329.

bladeswill commented 6 months ago

When I try to quantize "TinyLlama/TinyLlama-1.1B-Chat-v1.0" quantize --imatrix TinyLlama-1.1B-Chat-v1.0\ggml-model-f16.gguf TinyLlama-1.1B-IQ3_XXS.gguf IQ3_XXS

on b2281 - b2356 , Will report an error:

[ 1/ 201] output.weight - [ 2048, 32000, 1, 1], type = f16, quantizing to q5_K .. size = 125.00 MiB -> 42.97 MiB [ 2/ 201] token_embd.weight - [ 2048, 32000, 1, 1], type = f16, quantizing to iq3_s .. ================================================================= iq3xs_init_impl(grid_size = 512) iq3xs_init_impl: 24733 neighbours in total size = 125.00 MiB -> 26.86 MiB [ 3/ 201] blk.0.attn_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MB [ 4/ 201] blk.0.ffn_down.weight - [ 5632, 2048, 1, 1], type = f16, quantizing to q4_K .. size = 22.00 MiB -> 6.19 MiB [ 5/ 201] blk.0.ffn_gate.weight - [ 2048, 5632, 1, 1], type = f16, quantizing to iq3_xxs .. ================================================================= iq3xs_init_impl(grid_size = 256) iq3xs_init_impl: 18985 neighbours in total size = 22.00 MiB -> 4.21 MiB [ 6/ 201] blk.0.ffn_up.weight - [ 2048, 5632, 1, 1], type = f16, quantizing to iq3_xxs .. size = 22.00 MiB -> 4.21 MiB [ 7/ 201] blk.0.ffn_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MB [ 8/ 201] blk.0.attn_k.weight - [ 2048, 256, 1, 1], type = f16,

============================================================ Missing importance matrix for tensor blk.0.attn_k.weight in a very low-bit quantization The result will be garbage, so bailing out

llama_model_quantize: failed to quantize: Missing importance matrix for tensor blk.0.attn_k.weight in a very low-bit quantization main: failed to quantize model from 'My gguf path‘

but on b2223 from llama-b2223-bin-win-cublas-cu12.2.0-x64.zip It works. I think there should be some changes between these versions. These changes also prevent all models of i-series ggufs (such as iq3-xxs) I made with the new version from working properly on LM Studio v0.2.16. The i-series gguf made using the b2223 version of quantize works fine on 0.2.16.

ggerganov commented 6 months ago

llama_model_quantize: failed to quantize: Missing importance matrix for tensor blk.0.attn_k.weight in a very low-bit quantization

Likely your imatrix is messed up - generate a new one:

./imatrix -m models/tinyllama-1b/ggml-model-f16.gguf -f some-data.txt -ngl 99
bladeswill commented 6 months ago

llama_model_quantize:量化失败:在极低位量化中缺少张量 blk.0.attn_k.weight 的重要性矩阵

可能你的 imatrix 被搞乱了 - 生成一个新的:

./imatrix -m models/tinyllama-1b/ggml-model-f16.gguf -f some-data.txt -ngl 99

Thank you, you are right, I changed a txt and reused "b2281 - b2356" to create a new imatrix.dat, and then successfully created TinyLlama-1.1B-IQ3_XXS.gguf

But the problem that all i-series gguf models (such as iq3-xxs) produced by the new version cannot work properly on LM Studio v0.2.16 still exists. Maybe I should wait for the update of LM Studio.

GlasslessPizza commented 6 months ago

I noticed that there are no fluctuations on other quantization types (such as Q3_K_M, Q4_K_S or Q4_0) but there are some variations on smaller non-mixtral models, so I tested a great deal of releases of llama.cpp since b2015 (IQ3_XXS introduced) on llama-2-7b.Q8_0.gguf:

b2015 | 6.3080 +/- 0.03552 | 2687315648 bytes
...
b2252 | 6.3065 +/- 0.03551 | 2687315648 bytes
b2253 | 6.2925 +/- 0.03533 | 2687315648 bytes
...
b2274 | 6.2927 +/- 0.03533 | 2687315648 bytes
b2275 | 6.3261 +/- 0.03564 | 2585390784 bytes
...
b2314 | 6.3262 +/- 0.03565 | 2585390784 bytes
b2316 | 6.3139 +/- 0.03570 | 2585390784 bytes
...
b2364 | 6.3141 +/- 0.03571 | 2585390784 bytes

The result for llama-2-7b.Q8_0 is sane (the only regression at b2275 coincides with a decrease in model size) so unfortunately I'd have to test specifically on mixtral which is going to take a while.

GlasslessPizza commented 6 months ago

I managed to finish the mixtral test after a week of effort.

Here's the result using always the same imatrix for all versions (same as in my OP). The only parameter change from my OP is the number of chunks for ppl that i reduced to 200 as i noticed that you don't need that many chunks to see the variations:

b2015 | 4.8074 +/- 0.04731 | 18308777920 bytes
...
b2136 | 4.8086 +/- 0.04731 | 18308777920 bytes
b2137 | 4.8263 +/- 0.04754 | 18307082176 bytes
...
b2252 | 4.8263 +/- 0.04754 | 18307082176 bytes
b2253 | 4.8394 +/- 0.04753 | 18307082176 bytes
...
b2274 | 4.8393 +/- 0.04753 | 18307082176 bytes
b2275 | 4.9652 +/- 0.04869 | 18238711744 bytes
...
b2286 | 4.9653 +/- 0.04869 | 18238711744 bytes
b2287 | 4.9454 +/- 0.04841 | 18240407488 bytes
...
b2436 | 4.9467 +/- 0.04839 | 18240407488 bytes

Here's the result by recreating the imat ex-novo for every version:

b2015 | 4.7965 +/- 0.04708 | 18308777920 bytes
...
b2136 | 4.7984 +/- 0.04712 | 18308777920 bytes
b2137 | 4.8194 +/- 0.04738 | 18307082176 bytes
...
b2252 | 4.8195 +/- 0.04738 | 18307082176 bytes
b2253 | 4.8256 +/- 0.04726 | 18307082176 bytes
...
b2274 | 4.8254 +/- 0.04725 | 18307082176 bytes
b2275 | 4.9476 +/- 0.04840 | 18238711744 bytes
...
b2286 | 4.9472 +/- 0.04840 | 18238711744 bytes
b2287 | 4.9296 +/- 0.04814 | 18240407488 bytes
...
b2314 | 4.9296 +/- 0.04814 | 18240407488 bytes
b2316 | 4.9333 +/- 0.04822 | 18240407488 bytes
...
b2436 | 4.9336 +/- 0.04822 | 18240407488 bytes

The ellipsis signifies omission for brevity (= no changes in that range).

Conclusion: differently from llama-2-7b, all variations found for mixtral are anomalous (can't be explained by a suitably large change in model size). The variations happened at b2137, b2253, b2275, b2287 and an extra one for recalculated imat at version b2316.

Quantizing mixtral to IQ3_XXS using llama.cpp version b2015 instead of recent versions results in a gguf that performs 0.137 ppl better.

GlasslessPizza commented 5 months ago

I tried b2699 hoping the regressions were fixed along the way:

ikawrakow imatrix:

b2436 | 4.9467 +/- 0.04839 | 18240407488 bytes
b2699 | 4.9473 +/- 0.04839 | 18240407488 bytes

Recalculated imatrix:

b2436 | 4.9336 +/- 0.04822 | 18240407488 bytes
b2699 | 5.3510 +/- 0.05293 | 18240407488 bytes

Alas, quite the contrary: the biggest regression yet. We are now at a whopping +0.5436 ppl behind b2015 for recreated imatrix. It just keeps getting worse.

github-actions[bot] commented 3 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

GlasslessPizza commented 2 months ago

The problem is pretty much still present. I tried on b3334 today:

ikawrakow imatrix:

b3334 | 4.9494 +/- 0.04843 | 18240407648 bytes

Recalculated imatrix:

b3334 | 5.3557 +/- 0.05299 | 18240407840 bytes

Worse yet again.

Also I noticed alot of "llama_model_quantize_internal: did not find weights for" log lines. I suspect that at some point since b2436 imatrix generation stopped working completely.

GlasslessPizza commented 1 month ago

I guess I'll bump this issue every two weeks to prevent the bot from autoclosing it, this is my life now. Tried on b3484 today:

ikawrakow imatrix:

b3484 | 4.9538 +/- 0.04846 | 18240407648 bytes

Recalculated imatrix:

b3484 | 5.3537 +/- 0.05293 | 18240407840 bytes

Fixed imatrix worse yet again, recreated ex-novo imatrix still broken.

slaren commented 1 month ago

I have added the bug tag that will prevent the bot from closing the issue. Pointing at the specific PRs that introduced a regression would improve the chances of this being fixed dramatically. If there are multiple regressions, it might be better to create a different issue for each one, with instructions to reproduce the issue with the smallest model possible.

GlasslessPizza commented 1 month ago

I have added the bug tag that will prevent the bot from closing the issue. Pointing at the specific PRs that introduced a regression would improve the chances of this being fixed dramatically. If there are multiple regressions, it might be better to create a different issue for each one, with instructions to reproduce the issue with the smallest model possible.

Thanks! A few posts above I tested every release of llama.cpp since b2015 and pointed out the specific releases that introduced regressions. I since stopped testing all the releases as the sheer quantity of releases of llama.cpp outpaced the free time i have. I may open an issue for the broken imatrix creation tho.

In other news, i tried today with b3599:

ikawrakow imatrix:

b3599 | 4.9534 +/- 0.04846 | 18240407648 bytes

Recalculated imatrix:

b3599 | 5.3574 +/- 0.05298 | 18240407840 bytes

No change to note since b3484.

GlasslessPizza commented 1 week ago

In order to fix the imatrix creation i had to recreate the base q8 from the original repo using a new llama.cpp version:

b3680 | 4.9383 +/- 0.04829 | 18242464544

Alas, this only puts the ppl at the level of b2436, if not slightly worse.