elixir-nx / bumblebee

Pre-trained Neural Network models in Axon (+ 🤗 Models integration)
Apache License 2.0
1.27k stars 90 forks source link

Support for model quantization #249

Closed a-alhusaini closed 10 months ago

a-alhusaini commented 10 months ago

Running larger models on bumblebee is difficut for people with lower tier hardware.

Adding GPTQ and GGUF/GGML would greatly boost model accessability in the elixir ecosystem.

josevalim commented 10 months ago

This is a feature that will be added directly to Nx/EXLA. Currently WIP so stay tuned. :)

philpax commented 8 months ago

Hi there! I was wondering what the current status of this was; I found https://elixirforum.com/t/high-scale-performance-of-llms-needed-features/58562 but couldn't find any tracking issues for Bumblebee or NX.

I think quantization support is critical to making Bumblebee a viable option for developers/deployment; it makes it actually tractable to run larger models on consumer hardware.

Thanks in advance!

josevalim commented 8 months ago

We are currently updating XLA in Nx and the new version supports quantization. :) So hopefully sooner than later!

benbot commented 7 months ago

Is there an issue tracking that somewhere?

josevalim commented 7 months ago

Search for MLIR in the Nx project. Once that is done, we can start thinking about quantization!

shaqq commented 5 months ago

Adding GPTQ and GGUF/GGML would greatly boost model accessability in the elixir ecosystem.

Just to clarify, @josevalim, do you think Nx will be able to run GGUF models, even after the MLIR updates? I don't believe XLA will work with GGUF models out of the box, since that's a quantized model file format for llama.cpp:

https://github.com/ggerganov/ggml/blob/master/docs/gguf.md

josevalim commented 5 months ago

Those are two separate problems. Once we support quantization, then we may be able to run GGUF, as long as someone writes a deserializer for it.

SichangHe commented 1 week ago

Blocking on https://github.com/elixir-nx/nx/issues/1452.