ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.81k stars 9.29k forks source link

Bug: QWEN2 quantization GGML_ASSERT #7805

Closed bartowski1182 closed 1 week ago

bartowski1182 commented 3 months ago

What happened?

When attempting to quantize Qwen2 7B instruct to IQ2_XS I get the following assert:

GGML_ASSERT: ggml-quants.c:12083: grid_index >= 0

Anything I can provide to debug? Uploading the f32 file and imatrix now for recreation

Attempting IQ2_S now, will update if it fails in the same way update: it fails in the same way on the same block

Name and Version

Version b3086, ubuntu 22.04

What operating system are you seeing the problem on?

Linux

Relevant log output

[ 327/ 339]              blk.27.attn_norm.weight - [ 3584,     1,     1,     1], type =    f32, size =    0.014 MB
[ 328/ 339]               blk.27.ffn_down.weight - [18944,  3584,     1,     1], type =    f32, converting to iq2_xs .. GGML_ASSERT: ggml-quants.c:12083: grid_index >= 0
GGML_ASSERT: ggml-quants.c:12083: grid_index >= 0
GGML_ASSERT: ggml-quants.c:12083: grid_index >= 0
GGML_ASSERT: ggml-quants.c:12083: grid_index >= 0
GGML_ASSERT: ggml-quants.c:12083: grid_index >= 0
GGML_ASSERT: ggml-quants.c:12083: grid_index >= 0
GGML_ASSERT: ggml-quants.c:12083: grid_index >= 0
GGML_ASSERT: ggml-quants.c:12083: grid_index >= 0
GGML_ASSERT: ggml-quants.c:12083: grid_index >= 0
GGML_ASSERT: ggml-quants.c:12083: grid_index >= 0
GGML_ASSERT: ggml-quants.c:12083: grid_index >= 0
GGML_ASSERT: ggml-quants.c:12083: grid_index >= 0
GGML_ASSERT: ggml-quants.c:12083: grid_index >= 0
GGML_ASSERT: ggml-quants.c:12083: grid_index >= 0
GGML_ASSERT: ggml-quants.c:12083: grid_index >= 0
GGML_ASSERT: ggml-quants.c:12083: grid_index >= 0

(PS: is this a high severity or medium/low?)

grapevine-AI commented 3 months ago

Hello. I found that you make Q6_K first and then you re-quantize, you can make i-quants with i-matrix. Q6_K is extremely high-quality so I think this method probably realize faultless quants. In fact, my i-quants seem to be work correctly. Would you please help to verify.

mann1x commented 3 months ago

Would you please help to verify.

Do you have your quants uploaded somewhere?

grapevine-AI commented 3 months ago

My example is here. However, please note that I use both English and Japanese text because I am not sure what data English speakers typically use.

bartowski1182 commented 3 months ago

ignore, it's because i have a p100, bad report

hmm, this is odd..

Trying to run a Qwen2 quant (https://huggingface.co/bartowski/Tess-v2.5-Qwen2-72B-GGUF/blob/main/Tess-v2.5-Qwen2-72B-Q2_K.gguf) with GPU offloading yields a new assert:

GGML_ASSERT: ggml-cuda/dmmv.cu:665: false

based on the assert, my guess is that it's because ffn_down.weight was quantized to IQ4_NL? but obviously i'm not positive. Not offloading works fine, redownloading Q4_K_M to test since i see that didn't use IQ4_NL

@slaren any idea why this wasn't an issue in the past? Is there something special in Qwen2 that makes it want to use those quant types? Also it's strange because I thought IQ quants were fully supported on CUDA so i don't get why those asserts exist

edit: as suspected, no issues with Q4_K_M

bartowski1182 commented 3 months ago

also gonna pull @JohannesGaessler back in since he seems to know the area very well

JohannesGaessler commented 3 months ago

GGML_ASSERT: ggml-cuda/dmmv.cu:665: false

If you are on a P100 or Maxwell or older that is expected. For all quants other than the legacy quants and k-quants there is only a MMVQ implementation (which needs Pascal != P100 or newer) but no DMMV implementation. If you are using a GPU that should be supported that is a bug.

bartowski1182 commented 3 months ago

ah dammit thank you, that's silly of me, i have a 3090 and a p100 and this pushed it onto that card

You say p100 or maxwell/older, does that imply the p40 is fine?

slaren commented 3 months ago

Is there something special in Qwen2 that makes it want to use those quant types?

Most k and iq quants have a block size of 256, but this ffn_down in this model has a dimension that is not divisible by 256. IQ4_NL is used as a fallback for IQ4_XS and lower, since it has a block size of 32. For higher quants Q5_x or Q8_0 is used as the fallback instead, which is compatible with this GPU.

JohannesGaessler commented 3 months ago

You say p100 or maxwell/older, does that imply the p40 is fine?

Yes, P100s have compute capability 6.0, all other Pascal GPUs (including P40s) have compute capability 6.1 which is the minimum CC for the __dp4a instruction (per-byte dot product).

bartowski1182 commented 3 months ago

thanks both slaren and Johannes, that's good info, i appreciate it :D

back on the original subject of this issue, is there a reason your proposed change wasn't opened as a PR? I can open one if that would help, just unsure if the proposed fix was deemed not appropriate

mann1x commented 3 months ago

Would you please help to verify.

I'm testing the IQ3_XXS and seems to work very well. Tried as well to create the imatrix from the Q6_K but it didn't work, always NaN

slaren commented 3 months ago

is there a reason your proposed change wasn't opened as a PR?

Do you mean the fix to use fp32 precision for the attention? I didn't open a PR because the fix would affect all models with qwen2 architecture, and I recall reading that this model has other issues, and if this model is not very good, it may not be worth it to apply a patch that will decrease the performance of all the models with qwen2 architecture. But I may be wrong about that, if you think that it is worth it the fix could be merged.

bartowski1182 commented 3 months ago

The thing is we now have a lot of fine tunes of Qwen2 72b coming out, and presumably they all have this issue (I haven't re-verified) so figured it would make sense, but maybe it is worth double checking, didn't realize there would be a big performance hit

grapevine-AI commented 2 months ago

I'm testing the IQ3_XXS and seems to work very well. Tried as well to create the imatrix from the Q6_K but it didn't work, always NaN

Thanks for testing. Sorry, I simplified explanation too much so I'll try to make it clear. I have shown detail on my HF's README.

If anyone is interested in it, I would greatly appreciate it if you could attempt to reproduce the steps.

mann1x commented 2 months ago

If anyone is interested in it, I would greatly appreciate it if you could attempt to reproduce the steps.

Thanks! Still have the f32, will try again.

mann1x commented 2 months ago

If anyone is interested in it, I would greatly appreciate it if you could attempt to reproduce the steps.

Tried creating the imatrix from Q8_0 but it came out same as with f32, I get NaN when quantising

grapevine-AI commented 2 months ago

Tried creating the imatrix from Q8_0 but it came out same as with f32, I get NaN when quantising

Thank you!

Oh, no. I wonder if this method is depend on imatrix-dataset... I'll research other language text data.

CISC commented 2 months ago

@grapevine-AI Your imatrix is corrupt BTW:

imatrix entry "blk.20.ffn_down_exps.weight" contains non-normal value 0.000000, skipping!
imatrix entry "blk.18.ffn_down_exps.weight" contains non-normal value 0.000000, skipping!
imatrix entry "blk.11.ffn_up_exps.weight" contains non-normal value -nan, skipping!
imatrix entry "blk.15.ffn_down_exps.weight" contains non-normal value -nan, skipping!
imatrix entry "blk.13.ffn_down_exps.weight" contains non-normal value -nan, skipping!
imatrix entry "blk.3.ffn_down_exps.weight" contains non-normal value 0.000000, skipping!
imatrix entry "blk.12.ffn_down_exps.weight" contains non-normal value 0.000000, skipping!
imatrix entry "blk.12.ffn_gate_exps.weight" contains non-normal value -nan, skipping!
imatrix entry "blk.21.ffn_down_exps.weight" contains non-normal value 0.000000, skipping!
imatrix entry "blk.9.ffn_down_exps.weight" contains non-normal value 0.000000, skipping!
imatrix entry "blk.13.ffn_up_exps.weight" contains non-normal value -nan, skipping!
imatrix entry "blk.7.ffn_down_exps.weight" contains non-normal value 0.000000, skipping!
imatrix entry "blk.0.ffn_down_exps.weight" contains non-normal value 0.000000, skipping!
imatrix entry "blk.27.ffn_down_exps.weight" contains non-normal value 0.000000, skipping!
imatrix entry "blk.13.ffn_gate_exps.weight" contains non-normal value -nan, skipping!
imatrix entry "blk.2.ffn_down_exps.weight" contains non-normal value 0.000000, skipping!
imatrix entry "blk.4.ffn_down_exps.weight" contains non-normal value 0.000000, skipping!
imatrix entry "blk.12.ffn_up_exps.weight" contains non-normal value -nan, skipping!
imatrix entry "blk.1.ffn_up_exps.weight" contains non-normal value -nan, skipping!
imatrix entry "blk.6.ffn_down_exps.weight" contains non-normal value 0.000000, skipping!
imatrix entry "blk.14.ffn_down_exps.weight" contains non-normal value -nan, skipping!
imatrix entry "blk.11.ffn_down_exps.weight" contains non-normal value 0.000000, skipping!
imatrix entry "blk.16.ffn_up_exps.weight" contains non-normal value -nan, skipping!
imatrix entry "blk.17.ffn_up_exps.weight" contains non-normal value -nan, skipping!
imatrix entry "blk.14.ffn_up_exps.weight" contains non-normal value -nan, skipping!
imatrix entry "blk.19.ffn_down_exps.weight" contains non-normal value 0.000000, skipping!
imatrix entry "blk.16.ffn_gate_exps.weight" contains non-normal value -nan, skipping!
imatrix entry "blk.5.ffn_down_exps.weight" contains non-normal value 0.000000, skipping!
imatrix entry "blk.11.ffn_gate_exps.weight" contains non-normal value -nan, skipping!
imatrix entry "blk.15.ffn_up_exps.weight" contains non-normal value -nan, skipping!
imatrix entry "blk.17.ffn_gate_exps.weight" contains non-normal value -nan, skipping!
imatrix entry "blk.10.ffn_down_exps.weight" contains non-normal value 0.000000, skipping!
imatrix entry "blk.14.ffn_gate_exps.weight" contains non-normal value -nan, skipping!
imatrix entry "blk.8.ffn_down_exps.weight" contains non-normal value 0.000000, skipping!
imatrix entry "blk.15.ffn_gate_exps.weight" contains non-normal value -nan, skipping!
imatrix entry "blk.1.ffn_gate_exps.weight" contains non-normal value -nan, skipping!
imatrix entry "blk.1.ffn_down_exps.weight" contains non-normal value -nan, skipping!
imatrix entry "blk.16.ffn_down_exps.weight" contains non-normal value 0.000000, skipping!
imatrix entry "blk.17.ffn_down_exps.weight" contains non-normal value 0.000000, skipping!

I have made some progress though, by quantizing the BF16 to F16 and flushing anything <±1e-24 to ±0 I can now make fully working quants (with #7825 and #7955 applied) that do not require Flash Attention! However creating a fully activated imatrix still does not seem feasible...

flastir commented 2 months ago

It appears that the Qwen2-72B model stopped functioning correctly after release b3091.

Result in the release b3091:

User: who are you?
Llama: I am Llama, a friendly and helpful chatbot designed to assist you with any questions or tasks you might have. My goal is to make your day easier by providing accurate information and engaging in meaningful conversations.

Result in the next release b3130:

User: who are you?
Llama: !t 10.
The- .4s híst, and of
A) ()/b0–all.
...

In the latest release that I checked b3291 the model still doesn't work correctly.

The GGUF model was downloaded from here.

I started the server from llama-b3091-bin-win-cuda-cu12.2.0-x64.zip with this command:

server.exe -m Qwen2-72B-Instuct-Q5_K_M-00001-of-00002.gguf -c 2048 -ngl 0 -fa

The GGUF works normally in KoboldCpp v1.69.1 only if "Use FlashAttention" is checked and "Use QuantMatMul (mmq)" is unchecked.

Is there an argument to server.exe that disables MMQ?

JohannesGaessler commented 2 months ago

The GGUF works normally in KoboldCpp v1.69.1 only if "Use FlashAttention" is checked and "Use QuantMatMul (mmq)" is unchecked.

Is there an argument to server.exe that disables MMQ?

If at all possible, please try to reproduce the issue using only llama.cpp code and use git bisect to identify the exact commit that introduced the problem. You can disable MMQ by compiling with GGML_CUDA_FORCE_CUBLAS.

Also this very much sounds like a different problem than what was discussed previously here. Instead of commenting on an existing issue, please open a new issue instead. Otherwise there is a large risk that your issue will not get attention from the right people.

flastir commented 2 months ago

Also this very much sounds like a different problem than what was discussed previously here. Instead of commenting on an existing issue, please open a new issue instead. Otherwise there is a large risk that your issue will not get attention from the right people.

Thank you! I've found the appropriate issue for my problem: Issue #8025

LostRuins commented 1 month ago

Necro-ing this thread, but just chipping in, IQ4_NL is probably not an ideal fallback as it breaks compatibility with any backend that does not have I-Quant support (e.g. Vulkan), and leads to people unsure why certain K-quants just don't work when they're supposed to.

I'd rather Q4_0 or Q8_0 be the default fallback for k-quants. This is just my 2 cents.

ggerganov commented 1 month ago

IQ4_NL is probably not an ideal fallback as it breaks compatibility with any backend that does not have I-Quant support (e.g. Vulkan), and leads to people unsure why certain K-quants just don't work when they're supposed to.

Hm, it shouldn't break compatibility - the current implementation will fallback to CPU computation when the backend does not support a certain type. It would be slow, but it would still work

github-actions[bot] commented 1 week ago

This issue was closed because it has been inactive for 14 days since being marked as stale.