Closed a-alhusaini closed 10 months ago
This is a feature that will be added directly to Nx/EXLA. Currently WIP so stay tuned. :)
Hi there! I was wondering what the current status of this was; I found https://elixirforum.com/t/high-scale-performance-of-llms-needed-features/58562 but couldn't find any tracking issues for Bumblebee or NX.
I think quantization support is critical to making Bumblebee a viable option for developers/deployment; it makes it actually tractable to run larger models on consumer hardware.
Thanks in advance!
We are currently updating XLA in Nx and the new version supports quantization. :) So hopefully sooner than later!
Is there an issue tracking that somewhere?
Search for MLIR in the Nx project. Once that is done, we can start thinking about quantization!
Adding GPTQ and GGUF/GGML would greatly boost model accessability in the elixir ecosystem.
Just to clarify, @josevalim, do you think Nx will be able to run GGUF models, even after the MLIR updates? I don't believe XLA will work with GGUF models out of the box, since that's a quantized model file format for llama.cpp:
Those are two separate problems. Once we support quantization, then we may be able to run GGUF, as long as someone writes a deserializer for it.
Blocking on https://github.com/elixir-nx/nx/issues/1452.
Running larger models on bumblebee is difficut for people with lower tier hardware.
Adding GPTQ and GGUF/GGML would greatly boost model accessability in the elixir ecosystem.