Enable using different quantization of models

ghthor commented 8 months ago

Please describe the feature you want

Currently it appears that tabby internally assumes that all models are using Q8 quantization, but that appears to not be a requirement. I forked the registry and modifying a Q8 download to instead download a a Q4_K_M download of the deepseek 6.7B model as I needed a smaller usage of RAM so I could run on my nvidia 2080 SUPER.

Tabby still downloads the model to a file name q8_0.v2.gguf but we see the sha256sum does match the Q4_K_M.gguf that I overloaded in my fork of registry-tabby.

✦ ❯ jq .[-1] /home/ghthor/.tabby/models/ghthor/models.json
{
  "name": "DeepseekCoder-6.7B",
  "prompt_template": "<｜fim▁begin｜>{prefix}<｜fim▁hole｜>{suffix}<｜fim▁end｜>",
  "urls": [
    "https://huggingface.co/TheBloke/deepseek-coder-6.7B-base-GGUF/resolve/main/deepseek-coder-6.7b-base.Q4_K_M.gguf"
  ],
  "sha256": "28cef03e1b2d2478dafdb09f1520417cab55efcd3d1cc22bb1950c90bcd8804b"
}

Mon Mar  4 10:45:14 2024 exit 0 🟢 took 2s
registry-tabby on  main 
✦ ❯ find ~/.tabby/models/ghthor/DeepseekCoder-6.7B/ggml/
/home/ghthor/.tabby/models/ghthor/DeepseekCoder-6.7B/ggml/
/home/ghthor/.tabby/models/ghthor/DeepseekCoder-6.7B/ggml/q8_0.v2.gguf

Mon Mar  4 10:45:17 2024 exit 0 🟢 took 2s
registry-tabby on  main 
✦ ❯ sha256sum /home/ghthor/.tabby/models/ghthor/DeepseekCoder-6.7B/ggml/q8_0.v2.gguf
28cef03e1b2d2478dafdb09f1520417cab55efcd3d1cc22bb1950c90bcd8804b  /home/ghthor/.tabby/models/ghthor/DeepseekCoder-6.7B/ggml/q8_0.v2.gguf

Mon Mar  4 10:45:22 2024 exit 0 🟢 took 4s

Once I performed this "override" via my fork of registry-tabby, I was able to load the model without issue as llama-cpp doesn't require that we use only Q8 models.

I think this would probably require an additional field to the registry-tabby json structure that would allow tabby to map the model file to a different filename; in addition we couldn't hardcode the model filename as has been done here.

Implementation details aside, my main point is that llama-cpp supports loading ~~ggml~~ gguf models other than Q8 and it would be nice if tabby supported this without the ugly registry hack that I've done

Additional context

Please reply with a 👍 if you want this feature.

wsxiaoys commented 8 months ago

Hey - the naming convention of "q8_0" is primarily due to legacy reasons - it doesn't necessarily mean that only q8_0 quantized models can be loaded.

The approach you took with the registry is actually the recommended method for loading different model checkpoints. Feel free to name your model something like {ModelName}-q4 to distinguish it.

Related discussion: https://github.com/TabbyML/tabby/issues/1398

rudiservo commented 8 months ago

You should just add it do documentation and close this, it's not really an issue, I ran FP16, Q6_K, Q5_M all using the same q8_0.v2.gguf file name without any issue. IMO dough going lower than Q8 will affect the quality of model noticeably. i.e. starcoder2 15b Q8 has noticeable better code quality and accuracy than the Q6, even with context.

TabbyML / tabby

Enable using different quantization of models #1615