Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.
https://llamafile.ai
Other
19.96k stars 1k forks source link

Bug: imatrix quant gguf models (e.g. IQ3_XS, IQ2_M) not using NV GPU properly with `llamafile-0.8.14` #603

Closed wingenlit closed 2 hours ago

wingenlit commented 4 hours ago

Contact Details

No response

What happened?

when using important matrix quantized models on llamafile, with model fully offloaded to GPU, it appears that GPU is not doing much compute, while CPU is under heavy load.

notes:

Version

llamafile v0.8.14

What operating system are you seeing the problem on?

Linux, Windows

Relevant log output

No response

jart commented 2 hours ago

We've needed to disable the IQ quants by default in GGML CUDA for the time being, due to issues surrounding code size and compile times. We're hoping to find a better workaround soon. But in the meantime, you can fix this by passing --recompile --iq to llamafile (assuming you've built it at HEAD) which will compile a ggml-cuda module for your system that has support for IQ.