From my own experience in text generation models, I found out that quantizing the output and embed tensors to f16 and the other tensors to q6_k (or q5_k) gives smaller files and better results that quantizing everything to q8_0.
Usually in my tests, i quantize the output and embed tensors to F16 and the inner ones to q5_k q6_k and q8_0.
The results I then test using llama.cpp qhich is quite fast even on cpu only...
From my own experience in text generation models, I found out that quantizing the output and embed tensors to f16 and the other tensors to q6_k (or q5_k) gives smaller files and better results that quantizing everything to q8_0.
Usually in my tests, i quantize the output and embed tensors to F16 and the inner ones to q5_k q6_k and q8_0.
The results I then test using llama.cpp qhich is quite fast even on cpu only...
Can you please add the feature to ctensor2 too?