Mixtral OffLoading/GGUF/ExLlamaV2, which approach to use?

No, it's not the same thing.

Regarding ExLlamaV2 and llama.cpp (GGUF), I think it depends on your setup. As far as I know, ExLlamaV2 is faster on GPU but doesn't support CPU inference. llama.cpp on the other hand can split layers between CPU and GPU, reducing VRAM usage, and support pure CPU inference (initially, it was developed for CPU inference). They both are optimized for fast inference of LLMs and do their job pretty well. Note that they also have different quantization methods.

As for this project, we focus on optimizing inference for MoE-based models on consumer-class GPUs specifically. I can't tell you for sure right now when our method is faster/slower than the other ones, but we're currently researching that. It's also important to note that we used HQQ quantization, which is good but currently isn't very fast because it lacks good cuda kernels. Our team is actively working on supporting other quantization methods along with fast kernels and researching further possibilities to improve inference speed and quality.

Therefore, I believe our method is useful, at least if you don't have a lot of GPU VRAM (e.g., in Google Colab) or you want to fit a bigger model (with better quality) into it. We will do our best to implement new features and reach out to you as fast as possible.

dvmazur / mixtral-offloading

Mixtral OffLoading/GGUF/ExLlamaV2, which approach to use? #11