ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.15k stars 9.34k forks source link

quantize to F32/F16/Q8_0 can result in a Q6_K output tensor #5818

Closed cebtenzzre closed 6 months ago

cebtenzzre commented 6 months ago

Running quantize with a target dtype of F32, F16, or Q8_0 can result in a Q6_K output tensor without --pure (ref https://github.com/ggerganov/llama.cpp/pull/5631#issuecomment-1965055798). This is surprising, as I would expect converting to F32 and then quantizing to F16 to produce similar results to converting directly to F16.

I suggest that the k-quant mixture logic should never attempt to decrease the quality of the output tensor, only increase it.

cebtenzzre commented 6 months ago

Fixed by ee35600b9061b1ea0c4ea87fce6844297632b2a8