Aaronhuang-778 / BiLLM

(ICML 2024) BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
https://arxiv.org/abs/2402.04291
MIT License
197 stars 13 forks source link

Request: please consider evaluating pareto-optimality of BiLLM #1

Open justheuristic opened 9 months ago

justheuristic commented 9 months ago

Hi!

Thank you for the paper! It is inspiring that you can compress weights to about 1 bit and the model still works better than random. A practical sub-2-bit quantization algorithm would be a great boon to researchers and practitioners alike.

I'd appreciate if you could study the practical significance of BiLLM further by evaluating it in terms of pareto-optimality[1]. To do so, please compare perplexity of different methods for the same total model size(GB), but with different models.

For instance, one of the results you report in Table 2 has Llama-2 70B @ 1.08bits/weight score the perplexity of 8.41. This has a total footprint of roughly 8.8GiB (70e9 * 1.08/8 / 2**30)

In comparison, the other ways a practitioner could achieve 8.8GiB footprint are:

  1. take Llama-2 13B, quantize to ~5.8 bits (13e9 * 5.8/8 / 2**30)
  2. take Llama-2 7B, quantize it to ~10.7 bits (13e9 * 10.7/8 / 2**30)

This comparison would present a more unified (and stricter) way to compare extreme (1-2 bit) quantization algorithms. Notably, some recent works ([2, 3, 4]) that present 2-bit quantization schemes report that they are not pareto-optimal at 2 bits (i.e. it is still better to take a smalelr model and quantize it to 3- or 4- bits), and only achieve parity at, e.g., 2.5-3 bits[2]. By the way, [2,3,4] would also make for an interesting baseline to compare BiLLM against.

If it turns out that BiLLM is not optimal at 1.08bits, it would be insightful to learn what is the bitwidth where it becomes is pareto-optimal.

[1] https://arxiv.org/abs/2212.09720 [2] https://arxiv.org/abs/2401.06118 [3] https://cornell-relaxml.github.io/quip-sharp/ [4] https://arxiv.org/abs/2307.13304

irthomasthomas commented 9 months ago

@justheuristic See for example, llama-2-13b at 4.90 bits is 7.86GB and scores a perplexity of 4.3 in exllamav2.

https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/

justheuristic commented 9 months ago

Thanks!

It is also curious if we could go in the opposite direction: find a pareto-optimal spot for BiLLM, even if it has a different bitwidth?

If any of the authors looked (or plan to look) in this direction, i'd appreciate it if you share what you find.

Aaronhuang-778 commented 9 months ago

Hi, @justheuristic, thank you for your request!

In practical applications, it's important to look into how to balance the size of LLMs and the bit-width, just like [1] and [2] did with exploring quantization between 2 to 4 bits. The design of BiLLM aims to push the possibilities of compressing LLMs under 1-bit quantization, an extreme form of quantization that initially challenges the achievement of reasonable outputs under Post-Training Quantization (PTQ) methods, surpassing the existing PTQ binary methods.

As illustrated in Figure 1 of the paper, particularly at 1-bit, the model capability collapses, resulting in an extreme marginal effect. So, talking about the Pareto optimality of different model sizes and bit widths in binary quantization is not identical to the objective of BiLLM. However, we think that Pareto exploration of binary quantization and mixed-precision non-binary quantization across different model scales presents a challenging and meaningful endeavor for the next step :-) .

Binary quantization and general bit-width quantization (2 to 8 bits) exhibit fundamental differences in their representation. Although the strategy of BiLLM is specifically tailored for binary weight design, we've also tested it at other sizes and give some preliminary data.

BiLLM-LLaMA-2 Weight Bits WikiText-2
7B 2.00 7.62
7B 2.28 6.94
7B 4.26 5.52
13B 2.00 6.41
13B 2.29 5.89
13B 4.28 4.91

BiLLM explores the PTQ strategy for 1-bit quantization, as PTQ compression does not involve any training or backpropagation processes. So, we didn't compare our method with another type called Quantization Aware Training (QAT). We also discovered that the latest quantization methods achieved good performance at 2 to 4 bits, but since they were not designed for 1 bit, they all suffer from precision collapse at this level. We are pleased to focus on the latest advanced methodologies and enhance our subsequent research endeavors.

[1]https://arxiv.org/abs/2212.09720

[2]https://arxiv.org/pdf/2401.06118.pdf