OpenNMT / CTranslate2

Fast inference engine for Transformer models
https://opennmt.net/CTranslate2
MIT License
3.21k stars 282 forks source link

[feature request] Mixed quantizations. #1730

Open 0wwafa opened 2 months ago

0wwafa commented 2 months ago

From my own experience in text generation models, I found out that quantizing the output and embed tensors to f16 and the other tensors to q6_k (or q5_k) gives smaller files and better results that quantizing everything to q8_0.

Usually in my tests, i quantize the output and embed tensors to F16 and the inner ones to q5_k q6_k and q8_0.

The results I then test using llama.cpp qhich is quite fast even on cpu only...

Can you please add the feature to ctensor2 too?

0wwafa commented 2 months ago

You can find my quantizations (in gguf) on huggingface here: https://huggingface.co/ZeroWw if you are interested.