ggerganov / llama.cpp

LLM inference in C/C++
MIT License
60.95k stars 8.7k forks source link

[feature request] conversion to gguf in a more pure form. #8086

Open 0wwafa opened 5 days ago

0wwafa commented 5 days ago

Hello, usually when quantizing I first convert a huggingface model to F16 gguf then I quantize that to my quantizations. I have noticed that convert does not produce a "pure" f16. I think there should be a flag as in the quantize program to allow a pure f16 (all tensors) or pure bf16 conversion.

compilade commented 5 days ago

I have noticed that convert does not produce a "pure" f16.

Do you mean that some tensors are in F32 in the resulting gguf model? These are usually 1D tensors which are very small anyway. (BTW, even llama-quantize --pure ... keeps 1D tensors as F32)

Some of the ggml operators used on 1D tensors (currently) only work on F32 tensors (e.g. ggml_norm), so a pure f16 gguf model would not work without modifications in ggml.c.

Is there a particular reason why you'd like extremely "pure" conversions?

0wwafa commented 4 days ago

Is there a particular reason why you'd like extremely "pure" conversions?

well. no.. I mean I wanted to make comparisons between f16 "pure" and my own quants (which are a mix of f16 and q5 or q6). They seem to be smaller at no cost.. almost no degradation. You can find those quants in my huggingface profile page under models: https://huggingface.co/ZeroWw