Closed bartowski1182 closed 4 months ago
If it is not too much trouble, can you upload the f32 model that you used? I don't think the imatrix matters here since the token embeddings don't use it.
@slaren no problem at all, uploaded here:
@slaren it happened again with the granite 34B model and Q2_K with your changes (b2928)
I'm out so don't have good access to my logs, the f32 will go up in a couple hours and I'll link you to it, just figured I'd let you know in advanced
@slaren f32 going up here:
https://huggingface.co/bartowski/granite-34b-code-instruct-GGUF
Failed on others too (Q3_K_S), not sure how many would fail, but failing in same way
I can grab the log if that would help, moved onto other things in the meantime
Any chance that bf16 or f16 wouldn't face this issue?
Any chance that bf16 or f16 wouldn't face this issue?
I don't think so, the tensors are converted to f32 before being quantized regardless.
I didn't get any errors when quantizing to Q3_K_S. It may depend on the imatrix being used, can you upload that too?
Uploaded
Using b2854
Converted Hermes-2-Theta-Llama-3-8B to F32, then measured imatrix with https://gist.github.com/bartowski1182/b6ac44691e994344625687afe3263b3a
Upon quanting, all sizes work fine, except for IQ4_NL which produces this output:
When I refer to "all quants" I mean these all work fine:
IQ1_S, IQ1_M, IQ2_XXS, IQ2_XS, IQ2_S, IQ2_M, Q2_K, IQ3_XXS, IQ3_XS, IQ3_S, IQ3_M, Q3_K_S, Q3_K_M, Q3_K_L, IQ4_XS, Q4_K_S, Q4_K_M, Q5_K_S, Q5_K_M, Q6_K, Q8_0