Closed wingenlit closed 3 weeks ago
We've needed to disable the IQ quants by default in GGML CUDA for the time being, due to issues surrounding code size and compile times. We're hoping to find a better workaround soon. But in the meantime, you can fix this by passing --recompile --iq
to llamafile (assuming you've built it at HEAD) which will compile a ggml-cuda module for your system that has support for IQ.
Contact Details
No response
What happened?
when using important matrix quantized models on llamafile, with model fully offloaded to GPU, it appears that GPU is not doing much compute, while CPU is under heavy load.
notes:
v0.8.13
v0.8.9
v0.8.6
without the same problemv0.8.12
and upVersion
llamafile v0.8.14
What operating system are you seeing the problem on?
Linux, Windows
Relevant log output
No response