Open ghthor opened 8 months ago
Hey - the naming convention of "q8_0" is primarily due to legacy reasons - it doesn't necessarily mean that only q8_0
quantized models can be loaded.
The approach you took with the registry is actually the recommended method for loading different model checkpoints. Feel free to name your model something like {ModelName}-q4
to distinguish it.
Related discussion: https://github.com/TabbyML/tabby/issues/1398
You should just add it do documentation and close this, it's not really an issue, I ran FP16, Q6_K, Q5_M all using the same q8_0.v2.gguf file name without any issue. IMO dough going lower than Q8 will affect the quality of model noticeably. i.e. starcoder2 15b Q8 has noticeable better code quality and accuracy than the Q6, even with context.
Please describe the feature you want
Currently it appears that tabby internally assumes that all models are using Q8 quantization, but that appears to not be a requirement. I forked the registry and modifying a Q8 download to instead download a a Q4_K_M download of the deepseek 6.7B model as I needed a smaller usage of RAM so I could run on my nvidia 2080 SUPER.
Tabby still downloads the model to a file name
q8_0.v2.gguf
but we see the sha256sum does match the Q4_K_M.gguf that I overloaded in my fork of registry-tabby.Once I performed this "override" via my fork of registry-tabby, I was able to load the model without issue as llama-cpp doesn't require that we use only Q8 models.
I think this would probably require an additional field to the registry-tabby json structure that would allow tabby to map the model file to a different filename; in addition we couldn't hardcode the model filename as has been done here.
Implementation details aside, my main point is that llama-cpp supports loading
ggmlgguf models other than Q8 and it would be nice if tabby supported this without the ugly registry hack that I've doneAdditional context
Please reply with a 👍 if you want this feature.