0xdevalias commented 10 months ago

Feature Description

Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do as an enhancement.

Motivation

It sounds like it's a fast/useful quantisation method:

https://towardsdatascience.com/exllamav2-the-fastest-library-to-run-llms-32aeda294d26
- https://github.com/mlabonne/llm-course/blob/main/Quantize_models_with_ExLlamaV2.ipynb
https://towardsdatascience.com/4-bit-quantization-with-gptq-36b0f4f02c34
https://huggingface.co/blog/gptq-integration
https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/
- A detailed comparison between GPTQ, AWQ, EXL2, q4_K_M, q4_K_S, and load_in_4bit: perplexity, VRAM, speed, model size, and loading time.

Possible Implementation

N/A

BarfingLemurs commented 10 months ago

https://github.com/ggerganov/llama.cpp/issues/1256 AFAIK, you can mix all k-quants in the same model with no performance issue, but no one has felt a need to make a preset lower than 3.4bpw. (Q2_K - mostly q3_k)

https://github.com/ggerganov/llama.cpp/pull/1106 The current models are already optimal, functioning better than GPTQ:

"As far as I can tell, we are now on par with best known GPTQ result for 7B, and better for 13B by about 0.05."

If you are hoping for faster cuda, https://github.com/JohannesGaessler says he wants to make improvements, but would be busy until end of December.

Green-Sky commented 10 months ago

iirc, only llama and to a degree falcon use an mix of kquants, that has been hand optimized for low perplexity. So there might be unused optimizations on the table for k-quant mixes.

edit: this info might be out-of-date, so if anyone has an update on that, please let me know :)

8XXD8 commented 10 months ago

With exl2 you can fit a 70b model into 24gb of vram, but for 70b_q2 even 32gb is not enough. If k-quants have similar quality to exl2 at the same bpw, then it might be worthwhile to go below q2

Green-Sky commented 10 months ago

just to compare, I am running 70B on 32gig ram + 8(7)gig vram:

llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = mostly Q3_K - Small
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 27.86 GiB (3.47 BPW)
llm_load_print_meta: general.name   = LLaMA v2

so Q3_k_s -> 3.47 BPW, and thats basically the lowest quality I would go, anything below that really shows.

It would be very cool if we could compare the perplexity values between exl2 and llama.cpp at the same BPW.

KerfuffleV2 commented 10 months ago

It might be possible to add just dequantization support for some of those other formats. Quantizing can be complicated, dequantizing usually isn't too bad. Also those projects probably already have stuff like CUDA kernels available that could be yoinked if they have a compatible license.

agnosticlines commented 8 months ago

Is this something the developers are interested in/willing to add support for? Just trying to understand what's out there currently in terms of Mac LLM tech. I know this is no small feat I'm just trying to see if it's something that is in the roadmap/pipeline or if it's something the developers specifically do not want to implement.