Support for newer I-Quant formats

LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.

https://github.com/lostruins/koboldcpp

GNU Affero General Public License v3.0

4.66k stars 334 forks source link

Support for newer I-Quant formats #722

Closed InferenceIllusionist closed 5 months ago

InferenceIllusionist commented 5 months ago

Hi there, trying to run inference on an IQ4_XS quant (see this PR for more info - https://github.com/ggerganov/llama.cpp/pull/5747) Koboldcpp loads the model and immediately crashes. Before the window closes the error message mentions something about a unhandled exception with koboldcpp. I also noticed that the amount of CUDA memory being reserved is significantly higher than the size of the model (9.93 gb)

This issue also happens to other newer quant formats like IQ2_M. Any workarounds I could try? Thanks! screenshot-iq4_xs

LostRuins commented 5 months ago

The latest quant support in v1.59.1 currently is IQ3_S. Anything newer will require waiting for the next release, which should be out before the end of next week.

InferenceIllusionist commented 5 months ago

Understood, thanks for letting me know. Happy to help test once the next release is out and appreciate your work on this.

LostRuins commented 5 months ago

Should be working in the latest version!

InferenceIllusionist commented 5 months ago

Wow that was quick! Both IQ2_S and IQ4_XS are working. No issues at all after testing. Appreciate the fast follow-up and all the other new exciting features in the latest release.