ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.66k stars 9.42k forks source link

[Request/Enhancment]1-bit quants #5390

Closed benxh1995 closed 6 months ago

benxh1995 commented 7 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Feature Description

I would like to request an enhancement of the quantification process to allow 1bit quants. They don't have to be SOTA, just usable enough for users.

Motivation

The motivation for this request is to allow users with 8GB RAM and 16GB RAM access to the higher end of models (with 1bit quants ~70B should approximately fit in 16GB RAM).

JohannesGaessler commented 7 months ago

I'm not that knowledgeable when it comesto efficient quantization techniques (@ikawrakow is the expert for that) but I don't expect 1 bit quantization to be usable. Do you have any references for papers or code where someone has previously achieved usable 1 bit quantization?

benxh1995 commented 7 months ago

I know of BitNet and QMoE:

https://arxiv.org/abs/2310.16795

https://arxiv.org/abs/2310.11453

And there is this approach which is similar to GPTQ: https://arxiv.org/abs/2310.00034

benxh1995 commented 7 months ago

Highly relevant fresh paper describing binarization (1 bit quant) SOTA: https://huggingface.co/papers/2402.04291

ikawrakow commented 7 months ago

@benxh1995 Have you ever interacted with a model that has a perplexity of 32? (value for LLaMA-v2-7B from the SOTA paper you are quoting). A different question: Do you think that the 1-bit quantized LLaMA-v2-70B model with perplexity of of 8.4 will be competitive with a 4-bit quantized 7B model?

Don't get me wrong, the results of the paper are remarkable for 1-bit quantization, but that does not make them useful in practice. Btw., the current SOTA for 2-bit quantization has a perplexity of 3.94 for LLaMA-v2-70B. I guess, putting that into the paper instead of the hopelessly outdated GPTQ 2-bit result would make the 1-bit look much less impressive. In this repo you have a functioning 2-bit quantization with a LLaMA-v2-70B perplexity of 4.07. Have you tried it? If not, please do (you can download ready 2-bit quantized models from here). If you did, and you thought that it was not adequate, you can be assured that you will like 1-bit models even less.

benxh1995 commented 7 months ago

@ikawrakow Yes sir, I regularly use the Yi iq2_xxs quants, as well as the Mixtral quants. I am following your work quite often. Props to you for achieving what is pretty much SOTA. My motivation in this request was for anything more that could be squeezed out, even at higher perplexity, just like iq2_xxs is around 2.03(?)bpw, could an iq1_s be around 1.5/1.7 bpw? and would that be feasible?

I'm sorry for my ignorance. I'm just excited about the technology and squeezing out as much as possible out of constrained memory setups.

ghchris2021 commented 7 months ago

There's these now:

BiLLM: ...In response to this challenge, we present BiLLM, a groundbreaking 1-bit post-training quantization scheme tailored for pretrained LLMs. ...

https://huggingface.co/papers/2402.04291

https://arxiv.org/abs/2402.04291

https://github.com/Aaronhuang-778/BiLLM

AQLM:

... In this paper, we revisit the problem of "extreme" LLM compression--defined as targeting extremely low bit counts, such as 2 to 3 bits per parameter, from the point of view of classic methods in Multi-Codebook Quantization (MCQ)....

https://github.com/Vahe1994/AQLM

https://arxiv.org/abs/2401.06118

https://huggingface.co/BlackSamorez

nelsonhurstdev commented 7 months ago

Interesting. This may allow 120B Models to run at decent speeds on Consumer GPUs. This quant could be used to target 70b+ models. I feel like anything smaller may be useless.

ghchris2021 commented 7 months ago

Interesting. This may allow 120B Models to run at decent speeds on Consumer GPUs. This quant could be used to target 70b+ models. I feel like anything smaller may be useless.

Yes, that's what I've been thinking. For 34B for near future I can envision calling it good enough to use Q5-Q8 quants, multi-gpu to get acceptable quality / performance. But I'd really like to run things in the 70-120B range this or next year or so locally with good quality & useful performance levels without assuming I'll have more than 24-32 GB VRAM in a box and very possibly / desirably less than that if feasible. So 1-4 bits/parameter memory range is very interesting indeed.

I don't know enough about the research & history to understand if it's been done / reasonable but I just find it intuitively odd to train at high bit/parameter quality then jump through hoops to quantize things to 1-4 bits / parameter and not really knowing what one has lost in the process vs. just designing the model to train with 1-4 bit/parameter depths and let the training process "optimally set" each low resolution weight.

But I can see why the researchers who have access to vast SOTA GPU training / inference farms could hardly care less about the VRAM problems of end users inferencing when they just want to achieve SOTA maximum quality results to publish.

mechanicmuthu commented 7 months ago

Just guessing out loud. On 1 bpw do we reach a point where bitwise operators can come into play to speed up the low level computation?

ikawrakow commented 7 months ago

Just guessing out loud. On 1 bpw do we reach a point where bitwise operators can come into play to speed up the low level computation?

Not for this variant. Quants take values of -1, 0, 1. If we one day arrive at the point where we can separate salient from non-salient weights, then one would hope to be able to use binary quants for the non-salient version. This is what BiLLM does. But then again, looking at the massive difference in quantization error between this PR and BiLLM, this may turn out to be not valuable.

But given that you are bringing this up, are you dissatisfied with the performance? I get 212 t/s for TG-128 of a 7B model on my GPU (RTX-4080), which is ~60% higher than Q4_0. My best guess is that at this matrix multiplication speed, in the range of 40% of the time goes into thread synchronization and other kernels that are independent of the quantization used.

JohannesGaessler commented 7 months ago

Just guessing out loud. On 1 bpw do we reach a point where bitwise operators can come into play to speed up the low level computation?

I didn't test the performance of the new quantization myself but generally speaking the improvements from more efficient compute at low batch sizes are relatively small. I expect bitwise operations to only make a large difference if there were custom matrix multiplication kernels for large batch sizes (like mul_mat_q) that are compute bound rather than I/O bound.

github-actions[bot] commented 6 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 6 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.