Hello!
I did some research (using llama.cpp) and I found out that quantizing the input and embed tensors to f16 and the other tensors to q5_k or q6_k gives excellent results and almost indistinguishable from the pure f16 and with half the size.
Is it possible to do the same with bitsandbytes/transformers so to produce a model quantized in this way from a normal model?
Hello! I did some research (using llama.cpp) and I found out that quantizing the input and embed tensors to f16 and the other tensors to q5_k or q6_k gives excellent results and almost indistinguishable from the pure f16 and with half the size.
Is it possible to do the same with bitsandbytes/transformers so to produce a model quantized in this way from a normal model?
You find my (gguf) quantization at https://huggingface.co/ZeroWw for reference.
Thanks.