Open GlasslessPizza opened 6 months ago
Out of curiosity, did the resulting gguf sizes also change?
Out of curiosity, did the resulting gguf sizes also change?
Not enough to justify the difference:
b2037 31-1-2024 : 18,308,777,920 bytes
b???? 25-2-2024 : 18,307,082,176 bytes
b2329 03-3-2024 : 18,240,407,488 bytes
Can you try just before #5829?
Can you try just before #5829?
Sure, that would be b2314:
b2037 | 31-jan-2024 | 4.7009 +/- 0.02569 | 18,308,777,920 bytes
b???? | 25-feb-2024 | 4.7249 +/- 0.02576 | 18,307,082,176 bytes
b2314 | 02-mar-2024 | 4.8530 +/- 0.02642 | 18,240,407,488 bytes
b2329 | 03-mar-2024 | 4.8491 +/- 0.02636 | 18,240,407,488 bytes
Almost indistinguishable from b2329.
When I try to quantize "TinyLlama/TinyLlama-1.1B-Chat-v1.0" quantize --imatrix TinyLlama-1.1B-Chat-v1.0\ggml-model-f16.gguf TinyLlama-1.1B-IQ3_XXS.gguf IQ3_XXS
on b2281 - b2356 , Will report an error:
[ 1/ 201] output.weight - [ 2048, 32000, 1, 1], type = f16, quantizing to q5_K .. size = 125.00 MiB -> 42.97 MiB [ 2/ 201] token_embd.weight - [ 2048, 32000, 1, 1], type = f16, quantizing to iq3_s .. ================================================================= iq3xs_init_impl(grid_size = 512) iq3xs_init_impl: 24733 neighbours in total size = 125.00 MiB -> 26.86 MiB [ 3/ 201] blk.0.attn_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MB [ 4/ 201] blk.0.ffn_down.weight - [ 5632, 2048, 1, 1], type = f16, quantizing to q4_K .. size = 22.00 MiB -> 6.19 MiB [ 5/ 201] blk.0.ffn_gate.weight - [ 2048, 5632, 1, 1], type = f16, quantizing to iq3_xxs .. ================================================================= iq3xs_init_impl(grid_size = 256) iq3xs_init_impl: 18985 neighbours in total size = 22.00 MiB -> 4.21 MiB [ 6/ 201] blk.0.ffn_up.weight - [ 2048, 5632, 1, 1], type = f16, quantizing to iq3_xxs .. size = 22.00 MiB -> 4.21 MiB [ 7/ 201] blk.0.ffn_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MB [ 8/ 201] blk.0.attn_k.weight - [ 2048, 256, 1, 1], type = f16,
llama_model_quantize: failed to quantize: Missing importance matrix for tensor blk.0.attn_k.weight in a very low-bit quantization main: failed to quantize model from 'My gguf path‘
but on b2223 from llama-b2223-bin-win-cublas-cu12.2.0-x64.zip It works. I think there should be some changes between these versions. These changes also prevent all models of i-series ggufs (such as iq3-xxs) I made with the new version from working properly on LM Studio v0.2.16. The i-series gguf made using the b2223 version of quantize works fine on 0.2.16.
llama_model_quantize: failed to quantize: Missing importance matrix for tensor blk.0.attn_k.weight in a very low-bit quantization
Likely your imatrix is messed up - generate a new one:
./imatrix -m models/tinyllama-1b/ggml-model-f16.gguf -f some-data.txt -ngl 99
llama_model_quantize:量化失败:在极低位量化中缺少张量 blk.0.attn_k.weight 的重要性矩阵
可能你的 imatrix 被搞乱了 - 生成一个新的:
./imatrix -m models/tinyllama-1b/ggml-model-f16.gguf -f some-data.txt -ngl 99
Thank you, you are right, I changed a txt and reused "b2281 - b2356" to create a new imatrix.dat, and then successfully created TinyLlama-1.1B-IQ3_XXS.gguf
But the problem that all i-series gguf models (such as iq3-xxs) produced by the new version cannot work properly on LM Studio v0.2.16 still exists. Maybe I should wait for the update of LM Studio.
I noticed that there are no fluctuations on other quantization types (such as Q3_K_M, Q4_K_S or Q4_0) but there are some variations on smaller non-mixtral models, so I tested a great deal of releases of llama.cpp since b2015 (IQ3_XXS introduced) on llama-2-7b.Q8_0.gguf:
b2015 | 6.3080 +/- 0.03552 | 2687315648 bytes
...
b2252 | 6.3065 +/- 0.03551 | 2687315648 bytes
b2253 | 6.2925 +/- 0.03533 | 2687315648 bytes
...
b2274 | 6.2927 +/- 0.03533 | 2687315648 bytes
b2275 | 6.3261 +/- 0.03564 | 2585390784 bytes
...
b2314 | 6.3262 +/- 0.03565 | 2585390784 bytes
b2316 | 6.3139 +/- 0.03570 | 2585390784 bytes
...
b2364 | 6.3141 +/- 0.03571 | 2585390784 bytes
The result for llama-2-7b.Q8_0 is sane (the only regression at b2275 coincides with a decrease in model size) so unfortunately I'd have to test specifically on mixtral which is going to take a while.
I managed to finish the mixtral test after a week of effort.
Here's the result using always the same imatrix for all versions (same as in my OP). The only parameter change from my OP is the number of chunks for ppl that i reduced to 200 as i noticed that you don't need that many chunks to see the variations:
b2015 | 4.8074 +/- 0.04731 | 18308777920 bytes
...
b2136 | 4.8086 +/- 0.04731 | 18308777920 bytes
b2137 | 4.8263 +/- 0.04754 | 18307082176 bytes
...
b2252 | 4.8263 +/- 0.04754 | 18307082176 bytes
b2253 | 4.8394 +/- 0.04753 | 18307082176 bytes
...
b2274 | 4.8393 +/- 0.04753 | 18307082176 bytes
b2275 | 4.9652 +/- 0.04869 | 18238711744 bytes
...
b2286 | 4.9653 +/- 0.04869 | 18238711744 bytes
b2287 | 4.9454 +/- 0.04841 | 18240407488 bytes
...
b2436 | 4.9467 +/- 0.04839 | 18240407488 bytes
Here's the result by recreating the imat ex-novo for every version:
b2015 | 4.7965 +/- 0.04708 | 18308777920 bytes
...
b2136 | 4.7984 +/- 0.04712 | 18308777920 bytes
b2137 | 4.8194 +/- 0.04738 | 18307082176 bytes
...
b2252 | 4.8195 +/- 0.04738 | 18307082176 bytes
b2253 | 4.8256 +/- 0.04726 | 18307082176 bytes
...
b2274 | 4.8254 +/- 0.04725 | 18307082176 bytes
b2275 | 4.9476 +/- 0.04840 | 18238711744 bytes
...
b2286 | 4.9472 +/- 0.04840 | 18238711744 bytes
b2287 | 4.9296 +/- 0.04814 | 18240407488 bytes
...
b2314 | 4.9296 +/- 0.04814 | 18240407488 bytes
b2316 | 4.9333 +/- 0.04822 | 18240407488 bytes
...
b2436 | 4.9336 +/- 0.04822 | 18240407488 bytes
The ellipsis signifies omission for brevity (= no changes in that range).
Conclusion: differently from llama-2-7b, all variations found for mixtral are anomalous (can't be explained by a suitably large change in model size). The variations happened at b2137, b2253, b2275, b2287 and an extra one for recalculated imat at version b2316.
Quantizing mixtral to IQ3_XXS using llama.cpp version b2015 instead of recent versions results in a gguf that performs 0.137 ppl better.
I tried b2699 hoping the regressions were fixed along the way:
ikawrakow imatrix:
b2436 | 4.9467 +/- 0.04839 | 18240407488 bytes
b2699 | 4.9473 +/- 0.04839 | 18240407488 bytes
Recalculated imatrix:
b2436 | 4.9336 +/- 0.04822 | 18240407488 bytes
b2699 | 5.3510 +/- 0.05293 | 18240407488 bytes
Alas, quite the contrary: the biggest regression yet. We are now at a whopping +0.5436 ppl behind b2015 for recreated imatrix. It just keeps getting worse.
This issue was closed because it has been inactive for 14 days since being marked as stale.
The problem is pretty much still present. I tried on b3334 today:
ikawrakow imatrix:
b3334 | 4.9494 +/- 0.04843 | 18240407648 bytes
Recalculated imatrix:
b3334 | 5.3557 +/- 0.05299 | 18240407840 bytes
Worse yet again.
Also I noticed alot of "llama_model_quantize_internal: did not find weights for" log lines. I suspect that at some point since b2436 imatrix generation stopped working completely.
I guess I'll bump this issue every two weeks to prevent the bot from autoclosing it, this is my life now. Tried on b3484 today:
ikawrakow imatrix:
b3484 | 4.9538 +/- 0.04846 | 18240407648 bytes
Recalculated imatrix:
b3484 | 5.3537 +/- 0.05293 | 18240407840 bytes
Fixed imatrix worse yet again, recreated ex-novo imatrix still broken.
I have added the bug tag that will prevent the bot from closing the issue. Pointing at the specific PRs that introduced a regression would improve the chances of this being fixed dramatically. If there are multiple regressions, it might be better to create a different issue for each one, with instructions to reproduce the issue with the smallest model possible.
I have added the bug tag that will prevent the bot from closing the issue. Pointing at the specific PRs that introduced a regression would improve the chances of this being fixed dramatically. If there are multiple regressions, it might be better to create a different issue for each one, with instructions to reproduce the issue with the smallest model possible.
Thanks! A few posts above I tested every release of llama.cpp since b2015 and pointed out the specific releases that introduced regressions. I since stopped testing all the releases as the sheer quantity of releases of llama.cpp outpaced the free time i have. I may open an issue for the broken imatrix creation tho.
In other news, i tried today with b3599:
ikawrakow imatrix:
b3599 | 4.9534 +/- 0.04846 | 18240407648 bytes
Recalculated imatrix:
b3599 | 5.3574 +/- 0.05298 | 18240407840 bytes
No change to note since b3484.
In order to fix the imatrix creation i had to recreate the base q8 from the original repo using a new llama.cpp version:
b3680 | 4.9383 +/- 0.04829 | 18242464544
Alas, this only puts the ppl at the level of b2436, if not slightly worse.
If I quantize this gguf with this imatrix using this command:
and I calculate perplexity with this command:
I get three much different PPL values on three different versions of quantize.exe, everything else being equal:
I suspect that there have been multiple cumulative regression events on the IQ3_XXS quantization implementation between b2037 and b2329.
cu12.2.0 on Windows 10.