Bug: imatrix quant gguf models (e.g. IQ3_XS, IQ2_M) not using NV GPU properly with `llamafile-0.8.14`

Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.

Other

20.58k stars 1.04k forks source link

Contact Details

No response

What happened?

when using important matrix quantized models on llamafile, with model fully offloaded to GPU, it appears that GPU is not doing much compute, while CPU is under heavy load.

notes:

recompile feature is used; and successful
model is fully offload with nvtop confirmation on vram.
tested with same command on v0.8.13 v0.8.9 v0.8.6 without the same problem
seen strong degradation in performance on nvidia gpu (RTX4070 sm_89) on v0.8.12 and up

Version

llamafile v0.8.14

What operating system are you seeing the problem on?

Linux, Windows

Relevant log output