Closed ggerganov closed 1 year ago
@ggerganov "so for start we can do it just on the model tensors. The intermediate tensors during the evaluation can remain quantized using the existing approach, so that the evaluation is efficient." - do you mean, that Q4_0 quantize not only weights, but activations too now? Don't quite understand meaning of "model tensors"...
Might worth reading
A Survey of Quantization Methods for Efficient Neural Network Inference https://arxiv.org/pdf/2103.13630.pdf
Low-bit Quantization of Neural Networks for Efficient Inference deals with 4-bit quantization specifically.
As a smaller step, I can think of these optimizations:
max(abs(x_i))/8
instead of /7
as the scaling factor, then flip the sign of all values depending on whether the x_i with maximum amplitude was positive or negative. This allows us to use the full range of -8..+7This paper is also relevant: Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference, although this one deals with int8
0.0857 0.1714 0.2571 0.6
This should read 0.0857, 0.1714, 0.3428, 0.6
, correct? It seems to me that code the C++ code uses round()
not floor()
, so the third value will be rounded up.
However, if we choose the scaling factor to be
0.7
instead,
This should read 0.1
, correct?
I came up with a script that's able to compute RMS for various quantization methods - maybe it will come handy for experimenting: https://gist.github.com/prusnak/f54f8f33503458ca1aa9883f71897072
I was experimenting with grid search to find a better offset and scaling factor, but it does not seem to produce much better results than simply doing Q4_1. This doesn't justify making the whole process much slower (n_search^2
slower).
Pseudocode:
n_search = 30
data_min = min(data)
data_max = max(data)
search_step = (data_max - data_min) / n_search
for min_value in range(data_min, data_max, search_step):
for max_value in range(min_value + search_step, data_max, search_step):
perform Q4_1 but use min_value as offset and (max_value - min_value) as scaling_factor
measure RMS
when RMS is better than everything we've seen so far, store the result of this Q4_1 run
return the best Q4_1 run
Maybe someone can come up with a better grid search?
I also found the Lloyd-Max algorithm, but this one creates non-uniform quantization, which is no go for our usecase, I assume. Is that correct?
Resources:
I'm playing around with local search for the q4_1 parameters now, with something like the following approximately in place of the inner loop of quantize_row_q4_1
:
round_block(pp, x + i*QK, min, d);
float err = sq_error(pp, x + i*QK, min, d), err0=err;
int step_count = 0;
while(1) {
++step_count;
// const float next_mins[4] = { min*1.001f, min/1.001f, min, min };
// const float next_ds[4] = { d, d, d*1.001f, d/1.001f };
for (int i=0; i<16; ++i) {
// const float next_min = next_mins[i];
// const float next_d = next_ds[i];
const float next_min = min * (0.99f + 0.0002f*(rand()%100)); //next_mins[i];
const float next_d = d * (0.99f + 0.0002f*(rand()%100));//next_ds[i];
round_block(pp, x + i*QK, next_min, next_d);
float next_err = sq_error(pp, x + i*QK, next_min, next_d);
if (next_err < err) {
min = next_min;
d = next_d;
err = next_err;
goto quantize_row_q4_1_opt_next;
}
}
break;
quantize_row_q4_1_opt_next:;
}
static float rer = 0.0f;
rer = 0.001*(err/err0) + 0.999*rer;
printf("q: %d steps, err ratio %.3f, running %.3f\n", step_count, err/err0, rer);
round_block(pp, x + i*QK, min, d);
I found that the square error is indeed reduced by this, in a way that's quite sensitive to the parameter the loop over i
is bounded by (16 here; higher = it tries more directions to improve in at each step). At 4, I get an error ratio of about 0.93 (with the random walk being just a little better than the commented-out fixed directions); at 16, this is down to about 0.83, and at 64 it goes all the way down to 0.7. Obviously this is much slower than the preexisting code, so we'll have to wait for a while to see whether the lower quantization error actually translates to lower perplexity or anything.
(Yes, I'm aware I could do better picking a random direction, or even N deterministic ones, than that. I promise I'll make it less silly if it ever makes it into a PR.)
I came up with a script that's able to compute RMS for various quantization methods - maybe it will come handy for experimenting: https://gist.github.com/prusnak/f54f8f33503458ca1aa9883f71897072
@prusnak Do you know, which distribution llama weights have? Why do you use uniform distribution for tests? I suppose weights have normal-ish (or non-uniform at least) distribution. So we could have better results with non-uniform quantization.
Here is my suggestion: Currently with Q4_1 (QK=32) we have this size reduction 32*4 + 2*32 = 192 bits (compressed) 32*16 = 512 bits (non compressed) Size reduction = 512/192=2.66
What we could have is QK=128 (or even 256?) and 16 independent fp16 values. These fp16 values forms lookup table. Each weight is quantized to 4 bit, but its value is used as key in the lookup table (I know lookup table might be implemented using AVX). 16 fp16 values need to be adjusted for minimizing RMS error. Given weights in block have non-uniform distribution, such approach could be more profitable in terms of RMS error. Size reduction = 128*16/(16*16 + 128*4)=2.66
I also found the Lloyd-Max algorithm, but this one creates non-uniform quantization, which is no go for our usecase,
Any quantization method that can be evaluated efficiently works
(I know lookup table might be implemented using AVX)
This is interesting - do you know if it can be implemented with ARM NEON too?
I had some similar ideas - instead of the existing linear mapping Q4_0 of the "normally" distributed weights, make a simple transform of the data so you get more uniform distribution and quantize that instead. The transform has to be able to evaluate efficiently, so some approximation of uniform distribution would work.
Currently, the "outer" quantization bins are much less "utilized" compared to the "inner" (i.e. around the zero).
You can clearly see that during the quantize
run where we print the histograms for each tensor. With a uniform transform added, I expect the bins will get more evenly "utilized" and probably lead to some benefits in accuracy.
This is interesting - do you know if it can be implemented with ARM NEON too?
I'm not familiar with NEON, but it does have the vtbl/vtbx instructions which permutes bytes like vpshufb. I've used the latter to do in-register lookup tables.
If weights have a normal distribution, then I believe this approach is worth trying:
mu
sigma
f
to 2.0offset = mu - f * sigma
and scaling_factor = 2 * f * sigma
0 <= val <= 15
we perform val = max(min(val, 15), 0)
That way our quantisation will cover ~95% values for f = 2.0
. For other values of f
we have 99% for f = 3.0
and 68% for f = 1.0
.
We can try different values of f
to see which works the best. Or even we can try search for the best f
for given batch of data by iterating few values of f
(between 1.0 and 3.0 for example) and computing/comparing RMS, but this probably is not necessary and we can come up with a single fixed f
that is suitable to be used as a constant for every batch.
Reading the comments above - yeah, if we can efficiently implement a lookup table int8/int4->float16
using AVX/NEON, then it might be really worth trying the non-uniform approach.
I'm stuck with older codebase, so I modified the old quantize method in utils.cpp
Basically, I'm adjusting amax
by +-0.5 for 100 steps and search for the one that yields lowest RMS.
I also added rms_delta
output parameter to log improvements.
size_t ggml_quantize_q4_0(float * src, void * dst, int n, int k, int qk, int64_t * hist, float * rms_delta) {
const int nb = k / qk;
const size_t bs = (sizeof(float) + sizeof(uint8_t)*qk/2);
const size_t row_size = nb*bs;
assert(k % qk == 0);
const size_t pp_size = qk / 2;
uint8_t *pp = static_cast<uint8_t*>(alloca(pp_size));
char * pdst = (char *) dst;
float block[qk];
float rms_improvement = 0;
int rms_samples = 0;
for (int j = 0; j < n; j += k) {
uint8_t * pd = (uint8_t *) (pdst + (j/k)*row_size + 0*bs);
uint8_t * pb = (uint8_t *) (pdst + (j/k)*row_size + 0*bs + sizeof(float));
for (int i = 0; i < nb; i++) {
float amax = 0.0f; // absolute max
for (int l = 0; l < qk; l++) {
const float v = src[j + i*qk + l];
amax = std::max(amax, fabsf(v));
block[l] = v;
}
// search for best quantization factor
float bestD = 0;
float bestRms = 1e20;
float defaultRms = 1e20;
for(int attempt = 0; attempt < 100; attempt++)
{
float tweak = (attempt / 100.0) - 0.5;
const float d = (amax + tweak) / ((1 << 3) - 1);
const float id = d ? 1.0f/d : 0.0f;
float rms = 0;
for (int l = 0; l < qk; l++) {
const float v = block[l];
uint8_t vi = std::max((int)0, std::min(((int)round(v*id)) + 8, (int)15));
const float v_deq = (vi - 8) * d;
const float difference = (v - v_deq);
rms += difference * difference;
}
if(rms < bestRms) {
bestRms = rms;
bestD = d;
}
if(attempt == 50) // when tweak is 0
defaultRms = rms;
}
rms_improvement += sqrt(defaultRms) - sqrt(bestRms);
rms_samples++;
const float bestId = bestD ? 1.0f/bestD : 0.0f;
*(float *) pd = bestD;
pd += bs;
for (int l = 0; l < qk; l += 2) {
const float v0 = block[l + 0]*bestId;
const float v1 = block[l + 1]*bestId;
const uint8_t vi0 = std::max((int)0, std::min(((int)round(v0)) + 8, (int)15));
const uint8_t vi1 = std::max((int)0, std::min(((int)round(v1)) + 8, (int)15));
hist[vi0]++;
hist[vi1]++;
pp[l/2] = vi0 | (vi1 << 4);
}
memcpy(pb, pp, pp_size);
pb += bs;
}
}
*rms_delta = rms_improvement / rms_samples;
return (n/k)*row_size;
}
This works, although process is slower by 100 times. Output shows very marginal RMS improvements, around 0.0014 on average:
layers.31.attention.wo.weight - [ 4096, 4096], type = f16 size = 64.00 MB -> 10.00 MB, rms -= 0.0012 | hist: 0.024 0.020 0.025 0.038 0.056 0.078 0.098 0.113 0.118 0.113 0.098 0.078 0.056 0.038 0.025 0.022
layers.31.feed_forward.w1.weight - [ 4096, 11008], type = f16 size = 172.00 MB -> 26.88 MB, rms -= 0.0014 | hist: 0.026 0.020 0.025 0.038 0.056 0.077 0.098 0.112 0.118 0.112 0.098 0.077 0.056 0.038 0.025 0.022
layers.31.feed_forward.w2.weight - [11008, 4096], type = f16 size = 172.00 MB -> 26.88 MB, rms -= 0.0013 | hist: 0.026 0.019 0.024 0.036 0.054 0.076 0.099 0.117 0.124 0.117 0.099 0.076 0.054 0.036 0.024 0.021
layers.31.feed_forward.w3.weight - [ 4096, 11008], type = f16 size = 172.00 MB -> 26.88 MB, rms -= 0.0014 | hist: 0.026 0.020 0.025 0.038 0.056 0.077 0.098 0.113 0.118 0.113 0.098 0.077 0.056 0.038 0.025 0.022
Output of 7B model is nearly identical. At least it's not broken but I don't know if it's an improvement. This is just an experiment so you can decide if it's worth doing.
Reading the comments above - yeah, if we can efficiently implement a lookup table
int8/int4->float16
using AVX/NEON, then it might be really worth trying the non-uniform approach.
For the case of int4 on AVX, you can (ab)use VPSHUFB
for this fairly easily.
Let me come up with an instruction sequence for AVX2...
Maybe of our interest https://github.com/TimDettmers/bitsandbytes
I had some similar ideas - instead of the existing linear mapping Q4_0 of the "normally" distributed weights, make a simple transform of the data so you get more uniform distribution and quantize that instead. The transform has to be able to evaluate efficiently, so some approximation of uniform distribution would work.
Currently, the "outer" quantization bins are much less "utilized" compared to the "inner" (i.e. around the zero). You can clearly see that during the
quantize
run where we print the histograms for each tensor. With a uniform transform added, I expect the bins will get more evenly "utilized" and probably lead to some benefits in accuracy.
The CDF of a normal distribution is given by:
TF=0.5*(1 + erf((x - mean)/(sqrt(2)*sig)))
So if we compute the intermediate weights x' = TF(x) - 0.5
these will be uniformly distributed in [-0.5, 0.5]
.
I just ran the following patch to verify this:
diff --git a/ggml.c b/ggml.c
index c9a4e86..cc62e49 100644
--- a/ggml.c
+++ b/ggml.c
@@ -449,6 +449,8 @@ static inline __m128i packNibbles( __m256i bytes )
// blocks of QK elements
// represented with a single float (delta) and QK/2 8-bit ints (i.e QK 4-bit signed integer factors)
+#define TF(x, sig) (0.5*(1.0f + erf((x/sig)/sqrtf(2.0f))) - 0.5f)
+
// reference implementation for deterministic creation of model files
static void quantize_row_q4_0_reference(const float * restrict x, void * restrict y, int k) {
assert(k % QK == 0);
@@ -461,11 +463,17 @@ static void quantize_row_q4_0_reference(const float * restrict x, void * restric
uint8_t pp[QK/2];
+ double sig = 0.0;
+ for (int i = 0; i < k; i++) {
+ sig += x[i]*x[i];
+ }
+ sig = sqrt(sig/k);
+
for (int i = 0; i < nb; i++) {
float amax = 0.0f; // absolute max
for (int l = 0; l < QK; l++) {
- const float v = x[i*QK + l];
+ const float v = TF(x[i*QK + l], sig);
amax = MAX(amax, fabsf(v));
}
@@ -476,8 +484,8 @@ static void quantize_row_q4_0_reference(const float * restrict x, void * restric
pd += bs;
for (int l = 0; l < QK; l += 2) {
- const float v0 = x[i*QK + l + 0]*id;
- const float v1 = x[i*QK + l + 1]*id;
+ const float v0 = TF(x[i*QK + l + 0], sig)*id;
+ const float v1 = TF(x[i*QK + l + 1], sig)*id;
const uint8_t vi0 = ((int8_t) (round(v0))) + 8;
const uint8_t vi1 = ((int8_t) (round(v1))) + 8;
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2
tok_embeddings.weight - [ 4096, 32000], type = f16 quantizing .. size = 500.00 MB -> 78.12 MB | hist: 0.000 0.050 0.069 0.069 0.069 0.069 0.069 0.069 0.069 0.069 0.069 0.069 0.069 0.069 0.069 0.050
norm.weight - [ 4096, 1], type = f32 size = 0.016 MB
output.weight - [ 4096, 32000], type = f16 quantizing .. size = 500.00 MB -> 78.12 MB | hist: 0.000 0.050 0.068 0.069 0.069 0.069 0.070 0.070 0.070 0.070 0.070 0.070 0.069 0.069 0.068 0.049
layers.0.attention.wq.weight - [ 4096, 4096], type = f16 quantizing .. size = 64.00 MB -> 10.00 MB | hist: 0.000 0.042 0.057 0.064 0.069 0.074 0.076 0.078 0.079 0.078 0.076 0.073 0.069 0.064 0.057 0.042
layers.0.attention.wk.weight - [ 4096, 4096], type = f16 quantizing .. size = 64.00 MB -> 10.00 MB | hist: 0.000 0.044 0.061 0.066 0.070 0.072 0.074 0.075 0.075 0.075 0.074 0.072 0.070 0.066 0.061 0.044
layers.0.attention.wv.weight - [ 4096, 4096], type = f16 quantizing .. size = 64.00 MB -> 10.00 MB | hist: 0.000 0.050 0.067 0.067 0.068 0.069 0.070 0.072 0.073 0.072 0.070 0.069 0.068 0.067 0.067 0.050
layers.0.attention.wo.weight - [ 4096, 4096], type = f16 quantizing .. size = 64.00 MB -> 10.00 MB | hist: 0.000 0.052 0.056 0.058 0.064 0.071 0.077 0.081 0.082 0.081 0.077 0.071 0.064 0.058 0.056 0.052
layers.0.feed_forward.w1.weight - [ 4096, 11008], type = f16 quantizing .. size = 172.00 MB -> 26.88 MB | hist: 0.000 0.050 0.069 0.069 0.069 0.069 0.069 0.070 0.070 0.069 0.069 0.069 0.069 0.069 0.069 0.050
layers.0.feed_forward.w2.weight - [11008, 4096], type = f16 quantizing .. size = 172.00 MB -> 26.88 MB | hist: 0.000 0.050 0.069 0.069 0.069 0.069 0.070 0.070 0.070 0.070 0.069 0.069 0.069 0.069 0.069 0.050
layers.0.feed_forward.w3.weight - [ 4096, 11008], type = f16 quantizing .. size = 172.00 MB -> 26.88 MB | hist: 0.000 0.050 0.069 0.069 0.069 0.069 0.070 0.070 0.070 0.070 0.069 0.069 0.069 0.069 0.069 0.050
layers.0.attention_norm.weight - [ 4096, 1], type = f32 size = 0.016 MB
layers.0.ffn_norm.weight - [ 4096, 1], type = f32 size = 0.016 MB
layers.1.attention.wq.weight - [ 4096, 4096], type = f16 quantizing .. size = 64.00 MB -> 10.00 MB | hist: 0.000 0.049 0.068 0.069 0.069 0.070 0.070 0.070 0.070 0.070 0.070 0.070 0.069 0.069 0.068 0.049
layers.1.attention.wk.weight - [ 4096, 4096], type = f16 quantizing .. size = 64.00 MB -> 10.00 MB | hist: 0.000 0.049 0.067 0.069 0.069 0.070 0.070 0.070 0.070 0.070 0.070 0.070 0.069 0.069 0.068 0.049
The bins are now much more evenly utilized.
In this case, we no longer need to store amax
, but have to store sig
instead.
Also, assumed that the mean is 0 which is probably a valid assumption.
The bins are now much more evenly utilized.
I am wondering how we could update the algorithm so also the first bin is utilized, currently it is unused.
Also, assumed that the mean is 0 which is probably a valid assumption.
Maybe we can try computing the mean too and store both mean
and sigma
?
Most probably the storing of mean
would allow us to use the first bin easily.
Basic idea for a int4->f16 lookup table on AVX2:
Low nibbles:
VPSHUFB
uses the high bit to clear bytes to zero, which we don't want.vpshufb
s for low/high half of each f16, using the input nibbles as a lookup into a constant vector.vpunpcklbw
and a vpunpckhbw
to unshuffle the low/high half of each output f16.High nibbles:
Result:
vpunpcklwd
and vpunpckhwd
to unshuffle f16s back into the original order.You can get away without the final unshuffle if you preshuffle the input int4s instead.
I'm familiar with ARM, but not so much NEON. (The perils of embedded programming.) That being said, it looks very similar at a first glance, with the following wrinkles:
It looks like you can swap vpunpck*
to ZIP1
/ZIP2
, and vpshufb
to TBL
(same masking necessary, as out of range values turn to 0).
A quick attempt is at https://godbolt.org/z/G74oYP8nK. Not perfect - shuffles tend to be slow (throughput of 1/2) so I suggest storing the int4 table interleaved, and I haven't actually tested this, just stared at the output assembly for a bit - but may be good enough.
The bins are now much more evenly utilized.
I am wondering how we could update the algorithm so also the first bin is utilized, currently it is unused.
Also, assumed that the mean is 0 which is probably a valid assumption.
Maybe we can try computing the mean too and store both
mean
andsigma
?Most probably the storing of
mean
would allow us to use the first bin easily.
The reason why the first bin is not utilized is due to the following. Ignore floating-point rounding for a moment:
amax = max(abs(vals))
d = amax / 7
id = 1/d = 7 / amax
...and then something will be assigned to bin 0 if round(val*id) == -8
. In other words, if
round(val*id) == -8
(val*id) < -7.5
(val*7 / amax) < -7.5
val < amax * (-7.5 / 7)
-val * (7 / 7.5) >= amax
amax <= -val * (7 / 7.5)
...but here we have a contradiction. Because amax >= abs(val)
, and so amax >= -val
. And so bin 0 is never used.
This is because we're dealing with a signed bin, not an unsigned bin, and so the correct number is 15/2, not 7. See below.
For an optimal placement, assuming that the inputs have been transformed into a uniformish distribution with a maximum value of 1
, to a first approximation you want the following buckets:
Tl:DR: try this:
const float d = amax * (2.0f / 15.0f);
const float id2 = amax ? 8.0f/amax : 0.0f; // not 1/d!
[...]
const float v0 = TF(x[i*QK + l + 0], sig)*id2; // [-amax..amax] -> [-8..=8]
const float v1 = TF(x[i*QK + l + 1], sig)*id2; // [-amax..amax] -> [-8..=8]
// Edge case handling: if v0 == 8 (because input value == amax exactly), then we'd end up with +16 as a result.
// Deal with it by rounding this case down to 7.
// Ditto, due to rounding abs(v0) can end up slightly larger than 8. Preemptively fix up if so.
// Any value in [7..<8] works.
const float BELOW8 = 7.99999952316f; // nextbefore(8.0f) // 7.0f
const float v02 = min(max(v0, -8.0f), BELOW8); // [-8..=8] -> [-8..8]
const float v12 = min(max(v1, -8.0f), BELOW8); // [-8..=8] -> [-8..8]
const uint8_t vi0 = ((int8_t) (floor(v02))) + 8; // [-8..8] -> [0..16]
const uint8_t vi1 = ((int8_t) (floor(v12))) + 8; // [-8..8] -> [0..16]
Note that d != 1/id2
. This is deliberate.
(This does end up "always" shrinking the maximum or minimum value by amax/15
, which isn't ideal. I don't know which is worse - shrinking average stddev somewhat, or always shrinking the min/max value.)
Just in case if this detail was left unnoticed, the code I shared above that adjusts amax
naturally utilizes first bins as a side effect:
hist: 0.024 0.020 0.025 0.038 0.056 0.078 0.098 0.113 0.118 0.113 0.098 0.078 0.056 0.038 0.025 0.022
hist: 0.026 0.020 0.025 0.038 0.056 0.077 0.098 0.112 0.118 0.112 0.098 0.077 0.056 0.038 0.025 0.022
hist: 0.026 0.019 0.024 0.036 0.054 0.076 0.099 0.117 0.124 0.117 0.099 0.076 0.054 0.036 0.024 0.021
hist: 0.026 0.020 0.025 0.038 0.056 0.077 0.098 0.113 0.118 0.113 0.098 0.077 0.056 0.038 0.025 0.022
Histogram is still a bit skewed to the right but it's much more symmetric.
I tried to run the previously mentioned Q4_1 quantization method with some number of local relaxation steps to reduce the square error (down to 83% of the naive computation's error on average), but the result did not appear to improve on perplexity for Wikitext, being within 0.01 of naive Q4_1's after 30 steps (which I argued here to be sufficient for a preliminary estimate):
[1]4.5934,[2]5.1141,[3]6.0137,[4]6.5889,[5]6.6549,[6]6.6349,[7]6.8486,[8]6.9613,[9]7.3027,[10]7.5478,[11]7.7765,[12]7.8229,[13]7.7368,[14]7.8268,[15]8.0934,[16]7.6583,[17]7.5161,[18]7.4644,[19]7.0927,[20]7.0639,[21]6.9697,[22]6.7890,[23]6.7553,[24]6.6584,[25]6.6611,[26]6.4884,[27]6.2972,[28]6.1850,[29]6.0983,[30]5.9288,[31]5.8874,[32]5.9094,[33]5.8515
Along with this, I had some lower-quality data suggesting that just throwing out 1 min and max outlier when this improved square error actually made perplexity worse (by about 0.05). My current hypothesis is that perhaps it matters more to accurately represent weights that are further away from 0, as those wind up influencing the final dot product more. I want to try only throwing away the value closest to 0 next.
Perhaps Posit arithmetic could be valuable? http://www.johngustafson.net/pdfs/BeatingFloatingPoint.pdf
Also, assumed that the mean is 0 which is probably a valid assumption.
I used the following patch to compute histograms of the model:
diff --git a/convert-pth-to-ggml.py b/convert-pth-to-ggml.py
index ccf2c57..a17a3c2 100644
--- a/convert-pth-to-ggml.py
+++ b/convert-pth-to-ggml.py
@@ -22,6 +22,9 @@ import struct
import numpy as np
import torch
+from matplotlib import pyplot as plt
+idx = 0
+
from sentencepiece import SentencePieceProcessor
def parse_args():
@@ -124,6 +127,13 @@ def process_and_write_variables(fout, model, ftype):
fout.write(sname)
# data output to file
+ hist, bins = np.histogram(data, bins=100)
+ plt.stairs(hist, bins)
+ global idx
+ plt.savefig(f"hist_{idx:08}.png")
+ plt.clf()
+ idx += 1
+
data.tofile(fout)
def main():
From quickly inspecting the results it seems that most of the layers indeed have normal distribution around mean 0.0, but there are also around 20% of layers which have mean != 0.0.
Attaching the zipped histograms: hist.zip
Some examples:
Perhaps Posit arithmetic could be valuable? http://www.johngustafson.net/pdfs/BeatingFloatingPoint.pdf
Posits offer more dynamic range, at the expense of less accuracy for large numbers. If the largest weights matter the most, and they fit in an f16 (which we know they do), a posit is exactly the opposite of what we want.
From quickly inspecting the results it seems that most of the layers indeed have normal distribution around mean 0.0, but there are also around 20% of layers which have mean != 0.0.
Hm. What do those look like on a log plot? At a quick glance, those look heavier-tailed than a normal distribution.
Also, am I reading those correctly in that the max absolute weight is only ~2.5 or so? If so, a u16 scaling may actually be more accurate than a f16 scaling:
Instead of i4 * f16
, try i4 * u16 / 2^17
. This works for values between -4 and +3.75.
Say you're encoding a value near -2.5 as your max absolute. i4
is -8
in either scheme. With f16, your final step size is 2^-10. For u16, your step size is 2^-14.
The quantization process should attempt to optimally preserve the magnitudes of the weights first and foremost, which is easily accomplished by considering the binary representation of the original IEEE 754 half-precision floating-point (binary16) weights and using a significance coding method, which allows for combined quantization and sparsification of the weights to an arbitrary (byte-level limited only) bit-per-weight with sub-bit accuracy.
The main problem is that such an approach isn't (to the best of my knowledge) vectorizable nor fast, so I don't think it will interest you.
A binary16 float has the following format: 1 sign bit, 5 exponent bits in offset-binary, 10 explicitly stored significand precision bits.
A preliminary run through the current block of weights to be encoded would examine the 5 exponent bits only and record the highest value seen. We'd then encode, in 2 bits, the position of the first active bit in this max exponent value, counting down from the MSB. Example:
If the max exponent is 01101, we store the value "1" (01b). This allows us to know that we will need to do (4-1)=3 significance coding rounds (Note: there is no need to do significance coding for the LSB, obviously. A corner case is if the max exponent is 0000x, where we'd spend bits on a useless significance coding round, but this should not be a problem in practice).
To perform a significance coding round, we need to choose a positional encoding strategy. A simply binary partitioning scheme should work well in practice (depending on the block size).
We start with 3 lists, the significant list S (empty), the insignificant list I (contains the indexes for all the weights in this block), along with the refinement list R (empty):
S={}, I={0, 1, 2, 3,.., 31}, R={}
We now recursively divide the I list looking for weights that are significant at this level, i.e., whose 2nd MSB of the exponent bits is set, and encode this partitioning (either with DFS or BFS). Example using BFS:
Assume that the weights 3 and 11 are the only ones that are significant at this level.
We begin by dividing the I list in 2 sub lists, as I0,0={0..15} and I0,1={16..31}.
Since there are significant weights in I0,0, we code a "1" for it, and a "0" for I0,1.
We now divide I0,0 into I1,0={0..7} and I1,1={8..15}, and since both contain significant weights, we code 2 "1"'s.
If we keep proceeding like this, the final output for this significant coding round will be "1011101001010101", and the lists will look like this:
S={3, 11}, I={0, 1, 2, 4,.., 10, 12,.., 31}, R={}
We now encode the sign bits for the weights in the S list, and since the R list is still empty, we don't need to do a refinement round.
The weights in the S list are now moved to the R list, and we restart the process, this time the threshold for significance being the 3rd MSB of the exponent bits being set.
Seeing as how the R list is no longer empty, at the end of this significance coding round, we'd perform a refinement round, where we'd emit the value of the next bit in the binary representation of the weights (in this case, the 3rd MSB of the exponent) contained in it.
So after a few rounds, for the highest magnitude weights, we'd already have sent the full exponent and would then start sending the bits for the significand.
We'd proceed in this fashion until we exhaust the bit budget allowed for encoding this block of weights (say 128 bits if we want to encode 32 weights-per-block and use an average of 4 bits-per-weight. Or 80 bits would get us 2.5 bits-per-weight).
So it's possible to get fractional average bits-per-weight, and we get variable weight precision, where the most significant weights get much better precision, and insignificant ones get culled (in essence, achieving quantization and sparsification all in one go).
This is probably more interesting as a highly compressed model storage format for transmission purposes, where the weight block dequantization would only be done once and the in-memory calculations would then use the recovered fp16 weights.
Hm. What do those look like on a log plot? At a quick glance, those look heavier-tailed than a normal distribution.
Attached are histograms where y axis is on log scale: hist-log.zip
Also, am I reading those correctly in that the max absolute weight is only ~2.5 or so?
Correct! Minimum value in the whole LLaMa-7B model is -2.492
, maximum is 2.625
.
@MarcioPais very interesting. That may indeed be vectorizable if you process multiple quantization blocks at once.
What do you do if your bit budget is exhausted in the middle of a round?
encode this partitioning (either with DFS or BFS).
Another way to view this is you're just constructing a bitset of which of the entries in I are above the current threshold. Same effect, but somewhat easier to code.
So after a few rounds, for the highest magnitude weights, we'd already have sent the full exponent and would then start sending the bits for the significand.
This adds complexity, but in an ideal world you'd do this with round-to-nearest on the exponent, not round-towards-zero. If you have e.g. 3.999 (a.k.a. 2^1 with mantissa of 0x3FF), truncated at the end of the exponent, you'd recover a value of 2
, whereas 4
is better.
This is probably more interesting as a highly compressed model storage format for transmission purposes, where the weight block dequantization would only be done once and the in-memory calculations would then use the recovered fp16 weights.
Something I've been noting is that larger CPU models appear to be fairly heavily memory-bandwidth limited. It's possible that we can get this working 'well enough' to save time overall...
@prusnak - thank you!
Interesting. The layers are all over the place.
Some of it looks normalish:
(A normal distribution looks parabolic in a log plot.)
...and this one looks pretty much like a logistic distribution:
(A logistic distribution looks like a ^ in a log plot.)
...and this looks far closer to a Cauchy distribution or somesuch:
Correct! Minimum value in the whole LLaMa-7B model is -2.492, maximum is 2.625.
We're wasting bits then.
@nonnull-ca
What do you do if your bit budget is exhausted in the middle of a round?
For the significance coding and refinement rounds, you only need to check for end-of-buffer (EOB) at the end of each round, it's only when reading the sign bits that you need to individually check for EOB, and if so, discard the weights as insignificant also.
I usually handle this by having my bit-IO routines ignore the requests (when writing) or by returning zeros (when reading) after EOB.
This adds complexity, but in an ideal world you'd do this with round-to-nearest on the exponent, not round-towards-zero.
Yes, but that is simple and only affects the performance of the quantization stage, where we'd probably want to do preprocessing anyway.
Further experiments would obviously be needed, but it's easy to think of possible optimizations:
Unary coding of the exponent This would imply more significance coding rounds, but for weights deemed significant, their exponent would be fully-specified, i.e., the refinement steps could immediately start outputting the significand bits. It would also simplify implementing a hard-cutoff for exponents that we deem fully insignificant anyway, say lower than -7, for instance.
RLE relative positional encoding with variable bit-length Here we'd direcly output the distance to the next significant weight in the I list in as little bits as necessary1. Examples:
1) As before, 3 and 11 are the indexes of the significant weights. Output "00011" (3). There are still 28 indexes left in I, {4..31}, so bit-length remains at 5 bits. Output "00110" (6). There are still 20 indexes left in I, {12..31}, so bit-length remains at 5 bits. No more significant weights, output "11111" (terminator).
2) Only the weight at index 19 is significant. Output "10011" (19). There are still 12 indexes left in I, {20..31}, so bit-length is reduced to 4 bits. No more significant weights, output "1111" (terminator).
1 Truncated binary encoding would be more efficient, at the expense of higher complexity
Scaling as a preprocessing step, with extra side information to allow reversing it at the dequantization stage.
Variable exponent coding strategy and/or scaling per-block, variable positional encoding strategy per-block or per-round, etc. This implies spending more time on the quantization stage, but ensures even better results by exploring the space of possible encoding tools available.
Some observations about the current output:
The way this block-based quantization scheme works is by picking a scale for groups of, say, 32 weights. A natural question then is "what is the distribution of the scale values?" Well, let's assume that we're picking weights at random from a gaussian with mean 0 and stddev of σ. Your scale is going to be, roughly, 1/8th of the max value in a block[^1]. This is the largest order statistic, which is... complicated in general, but we can fairly easily Monte-carlo it.
For a stddev of 1 and QK=32, I get the following expected distribution of scales:
Note log/log scale. This is pretty close to a log-normal distribution with a (log) mean of -1.1668 and stddev of 0.1928, which I plotted in red.
I grabbed all the scale values (across all layers) of 63B aand plotted them as a lin/log:
You can see that most scales are near ~0.0045 or so, but with a tail to put it mildly. This makes sense - we're summing a bunch of different distributions, some of which have heavy tails.
[^1]: Actual figure is closer to max(v/7 if v>0 else -v/8 for v in vals)
, although not quite.
Alright, here's an entirely different approach:
Other things to check:
Something that would be useful for "someone" to do:
Run LLaMA on, say, the wikipedia text that we're using for perplexity, but including the gradient calculation. Keep track of mean((dPerplexity/dWeight)^2) for each weight across the text, and plot that versus the weight value for each layer.
Why? I don't know if relative or absolute error matters more for weights. (Or even something more complex). There's arguments to be made for both.
Maybe someone can come up with a better grid search?
I've implemented a basic optimal[^1] search locally. Essentially just a simple αβ search over i4s, with a little rearranging to avoid divisions and a few heuristics that seem to help speed. It's still very slow, but appears to have a ~14% improvement to RMS error over the naive approach[^2] when compressing a random sample of 1600 values from a gaussian (mean=0, stddev=0.1):
(This is RMS error of optimal search / RMS error of naive approach.)
Unfortunately, I'm only getting ~2000 batches per second (~64k weights / second) with it, so quantizing the entire thing would take... 11 days or so?
This is actually survivable if the thing can be multithreaded[^3] - and indeed this is essentially trivially parallelizable - but someone would need to rewrite the framework to be able to do so. Right now it streams everything sequentially.
(Looking at cachegrind, I also suspect a factor of 2 or so can be shaved off by moving from a recursive to an iterative approach, as there's a lot of redundant work and function call overhead currently, but meh. That transformation is annoying to do when you have a recursive call in the middle of a loop.)
[^1]: rounding errors on the scale coefficient notwithstanding. [^2]: well really, over the slightly-less naive approach, which flips the sign of everything if abs(min) > abs(max) [^3]: on 32 cores, this would be 8.5 hours or so.
Is there a simple guide or explanation about the differences between q4_0 and q4_1? I understand the latter is slower, but is it actually better and why/by how much?
Edit: Never mind, I found it. https://github.com/ggerganov/ggml/pull/27
Where do we stand on with GPTQ? I see there is now a convert script convert-gptq-to-ggml.py
by @comex, but isn't it more effective to just use GPTQ instead of Q4_1 or even Q4_0? Has anyone done some benchmarks?
Did you have chance to look at GPTQ, @ggerganov?
Resources:
@prusnak
My latest impression on the topic (I think based on @blackhole89 and @comex comments) is that GPTQ is equivalent to Q4_1
data storage in the sense that it has a floating point offset and scale factors + 4-bit quants. The difference between GPTQ and Q4_1
is the way that one computes the numbers. But the storage of the numbers is the same. Therefore, one can convert GPTQ models into Q4_1
models and ggml
will effectively gain support for GPTQ this way.
I haven't had the time to look into details. But it would be very interesting to see perplexity numbers for GPTQ using ggml
Not sure where I should be noting this, but I'm finding that q4_1
is much worse than q4_0
for a one-shot summarization task.
Here is the prompt I've been experimenting with (spoiler alerts for the 1965 novel Stoner; not using for any particular reason except that it's on my coffee table and not overly obscure):
Stoner, by John Williams.
# Brief Summary
The
Here's the q4_0
generation. There are errors (e.g., it's not written in the first person) but it basically gets the gist. I can't run the 30B model with fp16 locally, but based on what I've seen for 13B, q4_0
doesn't perceptibly degrade things from half-precision (at least for this task).
$ ./main -m ./models/llama/llama-30b-hf/consolidated/ggml-model-q4_0.bin --color -f ./prompts/summary.txt --temp 0.1 --top_p 0.75 --top_k 40 --repeat_penalty 1.1 -b 1 -t 11
# omitting preamble for brevity
The novel is a first-person narrative of the life of William Stoner, an English professor who teaches at the same Midwestern university from which he graduated. Stoner marries the daughter of a farmer, but finds her aloof and cold. The marriage produces a daughter, Katherine, but Stoner remains distant to both his wife and child.
Stoner's career is undistinguished but not unsuccessful; he teaches the required survey courses year after year, but his colleagues recognize his intellectual dedication by making him head of the English Department toward the end of
Here's q4_1
. There is just no comparison.
The book is about a boy named Johnny who lives in the city of New York. He has a dog named Lulu and he goes to school at Public School 101.
## Synopsis
Johnny's father is a policeman, and his mother is a teacher. Johnny's grandfather is a fireman, and his grandmother is a nurse. Johnny's uncle is an electrician, and his aunt is a secretary.
Again, these are for LLaMA 30B, but I'm getting the same results for 13B and Alpaca (w/ lora). Original weights were fp16.
Does it make difference if you go from f16 vs f32 to q4_0 vs q4_1? There are 4 possible choices!
With the current block size of 32, q4_0 is in practice a 5-bit quantization, and q4_1 a 6-bit quantization. With that in mind we could consider a q5_0 encoding similar to q4_0 that would be the same size (or smaller) than q4_1. I see some promising signs that 5-bits+magnitude with linear scaling, similar to q4_0, could outperform q4_1, but I would like to explore non-linear distributions as well.
One interesting thing about q4_1 is that it will sometimes flip the sign of the weight... Intuitively that might be an a worse error than similar magnitude errors that keep the sign. But in practice q4_1 beats q4_0 in perplexity, so maybe not?
With the current block size of 32, q4_0 is in practice a 5-bit quantization, and q4_1 a 6-bit quantization. With that in mind we could consider a q5_0 encoding similar to q4_0 that would be the same size (or smaller) than q4_1. I see some promising signs that 5-bits+magnitude with linear scaling, similar to q4_0, could outperform q4_1, but I would like to explore non-linear distributions as well.
One interesting thing about q4_1 is that it will sometimes flip the sign of the weight... Intuitively that might be an a worse error than similar magnitude errors that keep the sign. But in practice q4_1 beats q4_0 in perplexity, so maybe not?
additionally, the current recommended setting for GPTQ is 4.16 bit quantization. You can simply count the bits with the following code.
in_dim = 8192
intermediate_dim = 22016
bit = 4
groupsize = 128
# setting
def get_bit(indim,outdim,bit,groupsize):
q_weight = (indim // 32 * bit) * (outdim) * (32)
q_zeros = (indim // groupsize) * (outdim // 32 * bit) * (32)
scales = (indim // groupsize) * (outdim) * (16)
g_idx = (indim) * (32)
weight = (indim) * (outdim) * (16)
total_bit = ((q_weight + q_zeros + scales + g_idx)/weight) * 16
return total_bit
total_bit = 0
total_bit += get_bit(in_dim,in_dim,bit,groupsize) * 4 #q,k,v,o
total_bit += get_bit(in_dim,intermediate_dim,bit,groupsize) * 2 #gate,up
total_bit += get_bit(intermediate_dim,in_dim,bit,groupsize) * 1 #down
total_bit /= 7
print(total_bit)
I have a silly implementation at https://github.com/unbounded/llama.cpp/blob/q4-q-harder/ggml.c#L562-L687 that I believe calculates the actual RMSE-optimal scaling factor for each block, maybe to some limit of numerical precision errors, or whatever bugs I've missed.
It does this by lining up the "interesting" scaling values where quantization changes and finding the local optimal score analytically for each of ~512 resulting configurations.
Not suggesting this for actual use as it is very slow, and we can get 99.9% of the way there by smarter approaches like #835 But consider it a contribution to the problem stated in the OP :)
Seems to be some actual papers on the subject like https://openaccess.thecvf.com/content/CVPR2021/papers/Idelbayev_Optimal_Quantization_Using_Scaled_Codebook_CVPR_2021_paper.pdf which makes some similar observations but uses a different approach, maybe better suited to bigger blocks/bit sizes.
@unbounded : I have been playing around with your implementation; as it is it's probably too slow but could maybe be made faster.
I think if you ignore the sign of input values, i.e. convert them all to negative numbers, you could then halve the shape table:
const float shape[8] = {-7, -6, -5, -4, -3, -2, -1, 0};
// or
const float shape[9] = {-8, -7, -6, -5, -4, -3, -2, -1, 0};
But neither is quite as good, as you're either not using -8 or have the case where you need to clamp a +8 to a +7 when you fix the qi
s to match the actual sign of x[i]
.
Also, if you were to sort x[i]
before creating the event list, you could maybe take some work out of the sorting required afterwards, but I haven't looked into that.
After twiddling some more with #835 and @unbounded's code, I'm starting to doubt this is the right way: optimizing the scaling factor for RMSE, then checking perplexity to see if we haven't regressed.
We may be falling victim to the streetlight effect - optimizing for quantization RMSE just because that's fast and easy to measure.
When really, the better approach might be to search the best global scaling factor to achieve lowest perplexity. This may not be 7 as currently or 8 as in #729. It could be a larger value; clipping the maximum may be a worthy trade-off. (being clever with the sign of the scaling value as in #729 is probably a good idea anyway)
Hey guys, what is the official name for the compression method used for Q4?
Page 7 of this paper has a relatively efficient iterative procedure based on discrete calculus to find the scale factor for the minimum RMS quantisation error. They use it for a FPGA friendly two step log quantiser, but the math should work for a scaled uniform quantiser too.
https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136710657.pdf
I've finally had the time to implement some of the ideas I mentioned previously, and though it may be of purely academic interest, I'd like to share some results.
I implemented 3 types of significance coding strategies for unary exponent coding:
Until a pre-determined percentage of weights are deemed significant, 4 options (delta, bitset and binary partitioning with 2 different thresholds) are tried and the best one is chosen. Afterwards, the bitset encoding strategy is always used.
Testing on the 7B model, for the 65 small layers with 4096 weights that don't really follow the same weight distribution, an optional preprocessing step that performs variable rounding was also implemented. Results below:
Bits-per-weight | Method | RMSE | MaxAbsErr | Coverage |
---|---|---|---|---|
6 | q4_1 |
0.001782 | 0.127563 | 100% |
6 | q5_1 (ikawrakow) |
0.000761 | 0.052734 | 100% |
6 | this, no preprocessing |
0.000554 | 0.062500 | 95.96% |
6 | this, preprocessed |
0.000573 | 0.031250 | 95.96% |
5 | q4_0 |
0.002218 | 0.142578 | 100% |
5 | q4_0 (ikawrakow) |
0.001592 | 0.174804 | 100% |
5 | this, no preprocessing |
0.001111 | 0.125000 | 91,96% |
5 | this, preprocessed |
0.001126 | 0.062500 | 91,96% |
4 | q3_0 (sw) |
0.003526 | 0.378662 | 100% |
4 | this, no preprocessing |
0.002248 | 0.250000 | 83,95% |
4 | this, preprocessed |
0.002262 | 0.125000 | 83,95% |
3 | q2_0 (sw) |
0.007391 | 0.873046 | 100% |
3 | this, no preprocessing |
0.004663 | 0.415039 | 68,17% |
3 | this, preprocessed |
0.004677 | 0.249023 | 68,17% |
2.5 | this, no preprocessing |
0.006663 | 0.499756 | 54,43% |
2.5 | this, preprocessed |
0.006684 | 0.437256 | 54,43% |
2 | this, no preprocessing |
0.009402 | 1.849609 | 42,87% |
2 | this, preprocessed |
0.009431 | 1.748047 | 42,87% |
As we go down in average bits-per-weight, we see that even though as expected RMSE scales almost perfectly, the maximum absolute error explodes rather quickly, influenced by those small layers. If we choose to skip quantizing them, the results are much better:
Bits-per-weight | Method | RMSE | MaxAbsErr |
---|---|---|---|
2.5 | this, no preprocessing |
0.006648 | 0.246338 |
2 | this, no preprocessing |
0.009373 | 0.362305 |
Now, obviously, the most interesting aspect of this approach is not using it in CBR mode, but instead to use a VBR mode where the encoder stops whenever a certain metric is achieved. Possible useful metrics to try are RMSE, maximum and mean absolute errors, all divided by the range for each layer. Assuming a metric that translates well to perplexity degradation is used, that would allow us to get the smallest possible model size that still retains the quality we want.
The obvious elephant in the room here is that this encoding would have to be decoded and stored in memory in a custom sparse-capable format, so in practice it would probably only be useful with a very sparse and heavily quantized 65B model.
For future reference, doing a lossless encoding of the 7B model with this method requires an average of 13,76 bits-per-weight, so better than using the usual general-purpose compression algorithms, and only slightly behind a simple context-mixing compression algorithm.
Now that the significance coding method (SCM) provides a good baseline for what should be achievable, I ran a few experiments to see how close to it I could get with a simple q4_1-like encoding.
The range of values per block can be encoded in 10 bits (5 bits for the exponent and 5 bits for the mantissa) and the minimum value per block can be encoded in 12 bits (1 sign bit, 5 exponent bits, 6 mantissa bits).
That leaves us 2 bits, that I'm using to index into a lookup table of quantizer step mappings, so that we can pick the one that either provides the smallest squared error or the smallest absolute error. The first entry in the LUT is the original linear mapping, and then I'm using 3 different logit-like mappings:
{0, 1/15, 2/15, 3/15, 4/15, 5/15, 6/15, 7/15, 8/15, 9/15, 10/15, 11/15, 12/15, 13/15, 14/15, 1},
{0, 0.078923, 0.153184, 0.223077, 0.289074, 0.351819, 0.412110, 0.470876, 0.529124, 0.587890, 0.648181, 0.710926, 0.776923, 0.846816, 0.921077, 1},
{0, 0.083363, 0.160686, 0.232127, 0.298135, 0.359472, 0.417205, 0.472662, 0.527338, 0.582795, 0.640528, 0.701865, 0.767873, 0.839314, 0.916637, 1},
{0, 0.090349, 0.172875, 0.247260, 0.313666, 0.372851, 0.426236, 0.475849, 0.524151, 0.573764, 0.627149, 0.686334, 0.752740, 0.827125, 0.909651, 1}
At the default block size (QK) of 32, this works out to 4.75 bits-per-weigth (bpw). Now for some quick results, when optimizing the mapping choice for RMSE:
Bits-per-weight | Method | RMSE | MaxAbsErr |
---|---|---|---|
5 | q4_0 | 0.002218 | 0.142578 |
5 | q4_0 (ikawrakow) | 0.001592 | 0.174804 |
4.75 | q4_2 | 0.001577 | 0.118164 |
4.75 | significance coding | 0.001354 | 0.062500 |
4.375 | q4_2 (QK=64) | 0.001827 | 0.127831 |
4.375 | significance coding | 0.001784 | 0.062500 |
4.1875 | q4_2 (QK=128) | 0.002036 | 0.139648 |
4.1875 | significance coding | 0.002008 | 0.125000 |
At 4.75 bpw, the RMSE is ever so slightly better than the improved q4_0 method with 2 fp16 at 5 bpw, and significantly outperforms it in terms of MAE. However, it is still far from the result of SCM, until we start increasing the block size. At QK = 128, it still performs better than the original Q4_0 method, and is much closer to SCM.
Crucially, when compared to SCM, dequantization is much faster, and the LUT size can be cut in half by making use of the symmetry in the mappings.
@MarcioPais
Thank you for the detailed analysis. Few notes:
Testing on the 7B model, for the 65 small layers with 4096 weights that don't really follow the same weight distribution
The 1D normalization layers (i.e. layers.X.ffn_norm.weight
, layers.X.attention_norm.weight
, norm.weight
) can remain easily in F32 format. No point in quantizing those
The first entry in the LUT is the original linear mapping, and then I'm using 3 different logit-like mappings
How often do we end up choosing either of the 4 mappings?
I will need some time to get into all the details provided here (and also from other users), but I like the various ideas that are being generated. Just an update, the short-term plan is to try to implement #995 efficiently. After that, we can try to apply some RMSE optimizing strategy to additionally bring the perplexity down.
@ggerganov
How often do we end up choosing either of the 4 mappings?
QK | linear | map1 | map2 | map3 |
---|---|---|---|---|
32 | 24,05% | 22,75% | 19,83% | 33,37% |
64 | 15,03% | 22,09% | 21,61% | 41,27% |
128 | 6,01% | 18,29% | 22,27% | 53,43% |
As the block size increases, we're getting a better approximation of the distribution, and hence the most "skewed" mapping (map3) is increasingly favored, and the "backup" linear mapping is almost never used.
However, it's important to note that very skewed mappings may not always be better, especially if MAE is a main factor for perplexity. Here's a run at QK=128 with the most "conservative" non-linear mapping (map1) replaced with an even more skewed mapping than map3:
QK | linear | map4 | map2 | map3 | RMSE | MAE |
---|---|---|---|---|---|---|
128 | 4,69% | 26,37% | 28,33% | 40.61% | 0.002017 | 0.158203 |
Here, even if we get a very small improvement to RMSE (0.002036 => 0.002017), the MAE increases by a non-negligible amount (0.139648 => 0.158203).
It should also be considered that until proper perplexity measurements taken in controlled, reproducible runs are available, comparing on RMSE and/or MAE alone might not reflect the characteristics of all the different quantization strategies proposed. For instance, SCM doesn't necessarily encode a representation for every weight, so for all of those, their contribution to RMSE is probably bigger than in the other simpler methods that use a fixed bpw. But it's well known in the literature that for very large models, sparsification to a significant degree is usually possible and at low levels can even improve the results, so the difference in perplexity for different methods at similar RMSE may be higher than expected.
I ran some perplexity tests on SCM and the non-linear mapping quantization method (NLM), here are some results:
Bits-per-weight | Method | Perplexity |
---|---|---|
16 | FP16 (Reference) | 5.9565 |
6+ | q4_3 RMSE-optimized (#1106) + FP16 output tensor | 6.0085 |
6 | q4_1 | 6.0936 |
6 | q4_3 | 6.0617 |
6 | q4_3 RMSE-optimized (#1106) | 6.0344 |
6 | SCM | 5.9681 |
Bits-per-weight | Method | Perplexity |
---|---|---|
5 | q4_0 | 6.2103 |
5 | q4_2 | 6.1698 |
5 | SCM | 6.1347 |
4.75 | NLM | 6.1014 |
Bits-per-weight | Method | Perplexity | Perplexity at chunk # |
---|---|---|---|
4 | q3_0 (#1004) | ? | [1]4.8166, [2]5.2200, [3]6.1143 |
4 | SCM | 6.3334 | [1]4.4865, [2]5.0649, [3]5.9553 |
Bits-per-weight | Method | Perplexity |
---|---|---|
3 | q2_0 (#1004) | 12.6438 |
3 | SCM | 7.8300 |
Currently, in Q4_0 quantization we choose the scaling factor for each 32 group of weights as
abs(max(x_i))/7
. It is easy to see that this is suboptimal.Consider quantization of the following 4 numbers:
0.1 0.2 0.3 0.6
Currently, we would determine a scaling factor of
0.6 / 7 ~= 0.0857
and the dequantized numbers will be:0.0857 0.1714 0.3428 0.6
So the RMS between the dequantized and original values will be non-zero:
sqrt((0.1 - 0.0857)^2 + (0.2 - 0.1714)^2 + (0.3 - 0.3428)^2 + (0.6 - 0.6)^2) > 0.0
However, if we choose the scaling factor to be
0.1
instead, then it is easy to see that the original numbers will be quantized perfectly.So the scaling factor is better to be chosen as the one that minimises some error (e.g. RMS or whatever is more meaningful and easy to compute). Doing that we will certainly achieve better accuracy compared to the existing approach. The question is - how much better?
The goal of this task is to implement the described quantization above and evaluate the perplexity using the new approach. The approach in simple terms boils down to making a linear regression of the data with a fixed zero point. This new quantization might be a bit heavier to compute compared to
Q4_0
, so for start we can do it just on the model tensors. The intermediate tensors during the evaluation can remain quantized using the existing approach, so that the evaluation is efficient. If the results look promising, we can put effort into optimising the new approach and replacing completelyQ4_0
with it.Whoever demonstrates the results of this quantization will get the chance to give it a name and publish a paper (just kidding 😆 )
Similar strategy for determining the scale factor and offset factor can be applied to
Q4_1
.