Closed 0xdevalias closed 5 months ago
https://github.com/ggerganov/llama.cpp/issues/1256 AFAIK, you can mix all k-quants in the same model with no performance issue, but no one has felt a need to make a preset lower than 3.4bpw. (Q2_K - mostly q3_k)
https://github.com/ggerganov/llama.cpp/pull/1106 The current models are already optimal, functioning better than GPTQ:
"As far as I can tell, we are now on par with best known GPTQ result for 7B, and better for 13B by about 0.05."
If you are hoping for faster cuda, https://github.com/JohannesGaessler says he wants to make improvements, but would be busy until end of December.
iirc, only llama and to a degree falcon use an mix of kquants, that has been hand optimized for low perplexity. So there might be unused optimizations on the table for k-quant mixes.
edit: this info might be out-of-date, so if anyone has an update on that, please let me know :)
With exl2 you can fit a 70b model into 24gb of vram, but for 70b_q2 even 32gb is not enough. If k-quants have similar quality to exl2 at the same bpw, then it might be worthwhile to go below q2
just to compare, I am running 70B on 32gig ram + 8(7)gig vram:
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = mostly Q3_K - Small
llm_load_print_meta: model params = 68.98 B
llm_load_print_meta: model size = 27.86 GiB (3.47 BPW)
llm_load_print_meta: general.name = LLaMA v2
so Q3_k_s -> 3.47 BPW, and thats basically the lowest quality I would go, anything below that really shows.
It would be very cool if we could compare the perplexity values between exl2 and llama.cpp at the same BPW.
It might be possible to add just dequantization support for some of those other formats. Quantizing can be complicated, dequantizing usually isn't too bad. Also those projects probably already have stuff like CUDA kernels available that could be yoinked if they have a compatible license.
Is this something the developers are interested in/willing to add support for? Just trying to understand what's out there currently in terms of Mac LLM tech. I know this is no small feat I'm just trying to see if it's something that is in the roadmap/pipeline or if it's something the developers specifically do not want to implement.
This issue was closed because it has been inactive for 14 days since being marked as stale.
Does this mean llama.cpp won’t be adding support for exl2 or GPTQ?
Does this mean llama.cpp won’t be adding support for exl2 or GPTQ?
See https://github.com/ggerganov/llama.cpp/issues/4704#issuecomment-2033642469
Still seeking EXL2 support!
Feature Description
Please provide a detailed written description of what you were trying to do, and what you expected
llama.cpp
to do as an enhancement.Motivation
It sounds like it's a fast/useful quantisation method:
Possible Implementation
N/A