Significant drop in recall for 8 bit Scalar Quantizer

naveentatikonda commented 3 months ago

Description

Based on some of the benchmarking tests that I ran from OpenSearch, there is a significant drop in recall ( appx. 0.03) for 8 bits irrespective of space type, confidence interval or few other parameters. For the same configuration, the recall for 7 bits is atleast greater than 0.85.

Root Cause

As part of quantization, after normalizing each dimension of the vector into [0 to 2^bits - 1] range, we are casting it into byte to bring it into byte range of [-128 to 127]. For 7 bits, we are normalizing each value into 0 to 127 which is already in the byte range. So, there is no rotation or shifting of data. But, for 8 bits any vector dimension which is within 128 to 255 range after normalization, the sign and magnitude changes when it is type casted into byte which leads to a non-uniform shifting or distribution of data. As per my understanding, this is the potential root cause for this huge drop in recall.

// pseudo code for existing implementation in Lucene SQ for 8 bits

    float dx = v - minQuantile; // v is float vector value in a vector at destIndex
    float dxc = Max(minQuantile, Min(maxQuantile, v)) - minQuantile;

    // Normalize it into 0 to 255 range for 8 bits
    // scale = 255/(maxQuantile - minQuantile)
    float dxs = scale * dxc;

    if (dest != null) {
      dest[destIndex] = (byte) Math.round(dxs); // type casting into byte after normalization
    }

To validate this, I have updated the quantization code and tested it against L2 space type by linearly shifting (subtracting 128) each dimension after normalizing it into 0 to 255 range such that each dimension is uniformly distributed and within the byte range(finally round it and clip it to handle edge cases) of -128 to 127. With these changes, we get a min. recall of 0.86 for the same configuration.

Note - The below pseudo code is not a fix and it is a different quantization technique used to validate the root cause. This works only for L2 spacetype because L2 is shift invariant but other spacetypes like cosinesimil and inner product are not shift invariant.

// pseudo code for custom changes in Lucene SQ for 8 bits and L2 space type

    float dx = v - minQuantile; // v is float vector value in a vector at destIndex
    float dxc = Max(minQuantile, Min(maxQuantile, v)) - minQuantile;

    // Normalize it into 0 to 255 range for 8 bits
    // scale = 255/(maxQuantile - minQuantile)
    float dxs = scale * dxc;

    float a = Math.round(dxs-128) // subtract 128 to bring it into -128 to 127 byte range

    if (a > 127)
    a = 127

    if(a < -128)
    a = -128

    if (dest != null) {
      dest[destIndex] = (byte) a // type casting into byte 
    }

Beam Width	Max Connections	Dataset	SpaceType	Dimension	confidence interval
100	16	Cohere-wiki	L2	768	default

Bits	Primary Shards	Recall
8	1	0.03
7	1	0.85
4	1	0.57
8 (custom changes)	1	0.86
8 (custom changes)	4	0.93

@benwtrent @mikemccand Can you please take a look and confirm if you see this issue when tested with lucene-util?

mikemccand commented 3 months ago

Hmm is this the same issue as ~~#13519~~ er sorry I meant #13350.

mikemccand commented 3 months ago

OK I managed to run knnPerfTest.py from luceneutil, using mpnet vectors (768 dims) and I think I am also seeing horrific performance for int8 but OK for int4 and int7:

quantizedBits recall    latency nDoc    fanout  maxConn beamWidth       visited index ms

32 0.983         2.55   250000  20      64      250     9294    66542   1.00    post-filter
 4 0.645         2.13   250000  20      64      250     13010   79605   1.00    post-filter
 7 0.943         1.87   250000  20      64      250     11775   78806   1.00    post-filter
 8 0.002         2.81   250000  20      64      250     23879   79177   1.00    post-filter

NOTE: this is my first time successfully running knnPerfTest.py so it's entirely possible I messed something up! But given that I'm seeing decent recall for unquantized (32 bit) and 7 bit, I think the 8 bit result is believable and horrible.

benwtrent commented 3 months ago

Thank you @naveentatikonda for your investigation here! I am busy with some other things, but I want to circle back here and figure out how we can make int8 better.

mikemccand commented 3 months ago

Thank you @naveentatikonda for your investigation here!

+1!

I also ran luceneutil's knnPerfTest.py with the cohere-768-IP dataset (linked on #13350) and int8 also looks buggy:

quantizedBits recall    latency nDoc    fanout  maxConn beamWidth       visited index ms

32      0.957   12.81   1000000 20      64      250     28918   647759  1.00    post-filter
4       0.517    7.24   1000000 20      64      250     23293   1114730 1.00    post-filter
7       0.876    6.42   1000000 20      64      250     20291   1034489 1.00    post-filter
8       0.006   12.99   1000000 20      64      250     43487   1214556 1.00    post-filter

@naveentatikonda it looks like you have some changes to Lucene's sources that might fix the buggy quantization? Could you maybe open a PR, or post a git diff output here or so? Quantized math is hard!

naveentatikonda commented 3 months ago

Hmm is this the same issue as ~#13519~ er sorry I meant #13350.

@mikemccand First of all, sorry for the confusion, this issue(#13350) was actually created for int7 SQ using Inner Product. There was some misunderstanding, looking at this PR(#12582) I thought int8 was released in Lucene 9.10 but later realized it was 7 bits. That issue is resolved after using default confidence interval instead of 0.9. Will update and close that issue.

But, this issue exists with 8 bits for all the space types (tested on L2, inner product and cosine) and with all the confidence intervals (tested with 0.9, 1.0, default, 0)

navneet1v commented 3 months ago

@naveentatikonda do you have some ideas on how you fix the 8-bit quantization. Atleast, I see for L2 you suggested a change. Can you raise a small git diff to show how this can be fixed. You can add fix for L2 first and then IP fix can be added iteratively.

naveentatikonda commented 3 months ago

I also ran luceneutil's knnPerfTest.py with the cohere-768-IP dataset (linked on #13350) and int8 also looks buggy:
quantizedBits recall    latency nDoc    fanout  maxConn beamWidth       visited index ms

32      0.957   12.81   1000000 20      64      250     28918   647759  1.00    post-filter
4       0.517    7.24   1000000 20      64      250     23293   1114730 1.00    post-filter
7       0.876    6.42   1000000 20      64      250     20291   1034489 1.00    post-filter
8       0.006   12.99   1000000 20      64      250     43487   1214556 1.00    post-filter
@naveentatikonda it looks like you have some changes to Lucene's sources that might fix the buggy quantization? Could you maybe open a PR, or post a git diff output here or so? Quantized math is hard!

Thanks for the confirmation. I still don't have a solution which works out of the box for all 4,7 or 8 bits and for all space types, still working with my data science team (@milindshyani). The pseudo code I shared above works only for L2 with 8 bits, it was added just to validate the root cause. Will update if we find anything interesting that would solve the issue.

naveentatikonda commented 3 months ago

@mikemccand As you have mentioned I created a draft version of PR(#13526) which will atleast fix the L2 for 8 bits without breaking the existing functionality. We can work on this together and come up with a better solution which will fix all the space types. Thanks!

naveentatikonda commented 3 months ago

@benwtrent Can you please help me understand the following:

In terms of quantization, are we doing any extra processing for 4 and 7 bits when compared to 8 bits ? I believe not.
For 7 bits, how are we reducing memory usage compared to 8 bits. Are we doing any extra compression somewhere. Am I missing something ?
For 4 bits, should we must set compress flag to True to reduce the memory usage by about 50% (theoretically) compared to 8 bits ?
Sorry, might be a dumb question. Does compress only works for 4 bits ?

mikemccand commented 3 months ago

4. Sorry, might be a dumb question. Does compress only works for 4 bits ?

I was also curious about this compress option -- it is javadoc'd here and if I'm reading it correctly, yes, I think it only has an effect with int4 (and maybe in the future if we get to int2, int1, int0.5, etc., it will as well).

I think this applies to the quantized vectors, which are (offheap) hot during searching. So if you pass compress=true, those quantized vectors will take 50% of the (offheap) hot RAM as compressed=false, in exchange for some added CPU cost. But note that the HNSW graph is also hot during searching ... not sure how much RAM (I think also off-heap?) it will need vs the vectors, so you won't see fully 50% less hot RAM overall.

benwtrent commented 3 months ago

My concern for 8 bit quantization is the algebraic expansion of dot-product and the corrective terms.

For scalar quantization, the score corrections for dotProduct are derivable via some simple algebra, but I am not immediately aware of a way to handle the sign switch. I didn't bother digging deeper there as int7 provides basically the exact same recall. I am eager to see if 8bit can be applied while keeping the score corrections.

In case you need it, here is valuable background:

https://www.elastic.co/search-labs/blog/scalar-quantization-101

For some background on the small additional correction provided for int4 (or any scalar quantization where confidence_interval is set to 0):

https://www.elastic.co/search-labs/blog/vector-db-optimized-scalar-quantization

Let me see if I can answer all the other questions (Sorry if I missed any, 2nd thread related to scalar quantization and I might be conflating different things).

In terms of quantization, are we doing any extra processing for 4 and 7 bits when compared to 8 bits ? I believe not.

Typically not. But, int4 honestly needs dynamic confidence intervals to work. You cannot statically set the confidence interval if you want good recall without a ton of oversampling. Setting the confidence_interval to 0 is an indication that you want the quantiles to be dynamically calculated (not statically calculated via some confidence interval).

For 7 bits, how are we reducing memory usage compared to 8 bits. Are we doing any extra compression somewhere. Am I missing something ?

No, we are not. There are nice SIMD performance properties for dot-product, but similar nice properties can be applied if the signed byte limits are between -127 and 127

For 4 bits, should we must set compress flag to True to reduce the memory usage by about 50% (theoretically) compared to 8 bits ?

Correct.

might be a dumb question. Does compress only works for 4 bits ?

The only dumb question is an unasked one.

Correct. Only for 4 bits. I seriously considered always compressing (thus there being no parameter), but the performance hit was too significant. I was able to close it significantly over time through targeted Panama Vector APIs. I hope that we can move away from having an "uncompressed" version once we get off-heap scoring for quantized vectors. I have a draft PR for this, but I am running into weird perf issues and haven't yet been able to dig more deeply.

I think this applies to the quantized vectors, which are (offheap) hot during searching.

Absolutely correct.

not sure how much RAM (I think also off-heap?) it will need vs the vectors

Way way less. The main cost is the vectors themselves. The graph is way smaller (we do delta variable encoding for the neighbors). The graph size (per layer) is:

1 int per vector in that layer
1 int and its delta & variably encoded neighbors. This obviously changes based on the number of connections configured.

Consider the WORST case (where the delta & variable encoding does NOTHING), the base layer would have 32 connections. So, that is 33 * 4 bytes per vector for base layer of the graph, if its fully connected. This is way less than the vectors as vector dimensions are usually many hundreds (384 is the smallest performant model I have found, e5small).

naveentatikonda commented 3 months ago

@benwtrent Thanks for answering all my questions. Thinking from the customer POV, is there any advantage of using 7 bits over 8 bits especially any improvement in performance? One thing on the positive side as of now is as you mentioned there is no sign switch issue while using dot product. But, on the down side there will be an impact on the recall because with 7 bits we are quantizing data to a smaller range compared to 8 bits.

benwtrent commented 3 months ago

@naveentatikonda there is indeed a small recall reduction.

But I would need to dig further on how to do signed 8 but while maintaining the expected scoring.

Do you know how to do the algebraic expansion for signed 8 bit?

I do not want customers to get dramatically different scores between their raw vectors and their scalar quantized vectors.

naveentatikonda commented 3 months ago

Do you know how to do the algebraic expansion for signed 8 bit?

Sorry, I don't know. @milindshyani do you know how to do it ?

benwtrent commented 3 months ago

The point of the algebra is to ensure we end up not with just similar rankings, but also similar scores for quantized vectors.

For Euclidean, many terms cancel out and we just end up with the euclidean difference of the quantized values.

For dot product, we have additional values that relate to the quantiles utilized and the dimensional values of the vectors themselves.

MilindShyani commented 3 months ago

I would imagine that the formulae are not qualitatively different since singed vs unsigned are related by a linear transformation, i.e.,

x_new = 2*x_old - 127 -- (1).

If x_old has a range between 0 and 127, then x_new has a range between -127 and 127. We can still use the correction from the (very well written) blog,

Correction(x_i,x_j) = \alpha^2 x_i x_j + \alpha x_i min + + \alpha x_j min + min^2 --- (2)

but substitute x_old = (x_new+127)/2 in (2) to obtain the correction in terms of x_new (i.e. unsigned numbers). What do you guys think?

benwtrent commented 3 months ago

The only way to find out is to test it. I don't see how your suggestion would work without trying it out.

Its better to think about what it would be in the unsigned byte case. A scale from 0-255 is what int8 quantization naturally results in. Then to make it signed, you have a change to -127-127.

I don't immediately see how to adjust the correction to handle the switch from 0 * 255 to -127 * 127.

@naveentatikonda any solution should support max-inner-product, any testing with normalized vectors & dot-product might result in false assurance. If you are using Lucene util, I would recommend using some older cohere embeddings (with their model from before CohereV3), and using the -metric mip.

MilindShyani commented 3 months ago

My bad, the correction I proposed was handling 0 to 127 to -127 to 127 which does not make sense (unless you are on the real line, which is what tripped me).

I need to know more about signed representation (Two's vs Ones' complement), but cant we map 0 - 255 to -127 to 128 using shifts ? If we do so,

Dot product of floats as approximated by signed integers =

i_1 i_2 \alpha^2 + i_1 \alpha m + i_2 \alpha m + i_1\alpha^2 c + i_2 \alpha^2 c + 2m\alpha c + c^2 \alpha^2+ m^2

Where alpha = (max - min)/255, m = min, c = 128, and i_1 and i_2 are the signed ints.

The above was obtained by writing signed_int = (float - min)/(max - min) 255 - c and then expanding float_1float_2.

Note that for c = 0, the above answer should reduce to the unsigned int case.

mikemccand commented 3 months ago

My concern for 8 bit quantization is the algebraic expansion of dot-product and the corrective terms.

For scalar quantization, the score corrections for dotProduct are derivable via some simple algebra, but I am not immediately aware of a way to handle the sign switch. I didn't bother digging deeper there as int7 provides basically the exact same recall. I am eager to see if 8bit can be applied while keeping the score corrections.

In case you need it, here is valuable background:

https://www.elastic.co/search-labs/blog/scalar-quantization-101

For some background on the small additional correction provided for int4 (or any scalar quantization where confidence_interval is set to 0):

https://www.elastic.co/search-labs/blog/vector-db-optimized-scalar-quantization

Thanks @benwtrent -- this is very helpful. I think I understand better :)

I see that we are forced to conflate quantization and distance metric because we want to compute the distance metric directly in quantized space using efficient SIMD instructions.

I.e. one might ideally consider quantization mathematically as being its own problem regardless of distance metrics: you map float32 down to int8 or int4 etc., and it's bidirectional. You can re-map a quantized vector back to float32 (introducing quantization noise of course). But that would be too inefficient (de-quantizing then computing distance metrics in float32 space) ... so we must conflate.

Does the Lucene implementation have per-dimension calculation of quantiles/min/max? Or is it really a global min/max computed across all dimensions and all vectors' values in those dimensions? Scalar-quantization-101 seems to imply it's the latter. Do ML vectors "typically" have similar distributions of values within each dimension?

mikemccand commented 3 months ago

Could someone summarize what is actually wrong with Lucene here? From my luceneutil tests it looks like int8 with DOT_PRODUCT is buggy / does not work (.006 recall on cohere-768-IP where float32 gets 0.957).

E.g. int8 quantization only works if your vectors are unsigned, or int8 quantization only works with VectorSimilarityFunction.EUCLIDEAN or so?

I know we are discussing how to make the conflated quantization and distance metric combinations math work out (i.e. how to fix the issue), but I'm just trying to get the big picture of what the issue even is. How would we explain the buggy behavior to users?

benwtrent commented 3 months ago

@mikemccand

Does the Lucene implementation have per-dimension calculation of quantiles/min/max? Or is it really a global min/max computed across all dimensions and all vectors' values in those dimensions?

Indeed, it is the latter, over all dimensions.

Do ML vectors "typically" have similar distributions of values within each dimension?

For some vectors, this is true, for others, not so. A good example of vectors where this is NOT the case is CohereV2. However, the error distribution per dimension is independent and given the central limit theorem, the errors do not compound.

I know we are discussing how to make the conflated quantization and distance metric combinations math work out (i.e. how to fix the issue), but I'm just trying to get the big picture of what the issue even is. How would we explain the buggy behavior to users?

The scalar quantizer currently assumes the destination byte values are unsigned. However, Java cannot have unsigned byte values that are > 127. Consequently, when trying to utilize the full number of bits in int8 the scoring falls apart.

There are actually two ways to fix this bug:

Get unsigned byte dot-product (handling overflow, etc.)
Get scalar quantizer to accurately represent signed bytes with the dotProduct score corrections.

Looking back, when I finished up the ScalarQuantization support for int4, I should have disallowed 8 bits as a parameter to prevent this confusion. I allowed it and I shouldn't have.

mikemccand commented 3 months ago

I know we are discussing how to make the conflated quantization and distance metric combinations math work out (i.e. how to fix the issue), but I'm just trying to get the big picture of what the issue even is. How would we explain the buggy behavior to users?

The scalar quantizer currently assumes the destination byte values are unsigned. However, Java cannot have unsigned byte values that are > 127. Consequently, when trying to utilize the full number of bits in int8 the scoring falls apart.

OK, thanks. Regardless of the distance metric, int8 is buggy today. (Hmm, are there any special cases where int8 is working? I see blog posts about using int8 in Elasticsearch...). I would think even if your vectors are e.g. always non-negative in all values, int8 quantization will still try (in a buggy way) to use the full 8 bits, and will still be buggy.

Looking back, when I finished up the ScalarQuantization support for int4, I should have disallowed 8 bits as a parameter to prevent this confusion. I allowed it and I shouldn't have.

Well since it doesn't work at all today, I think we can freely change/fix the behavior, e.g. in 9.12 or even a 9.11.2? We could simply remove support for int8? Anyone who is using it today is getting terrible results so they really should not be using it, and on upgrade, when they see Lucene no longer supports it, they can fix their code to either go to int4 or int7?

We could (later, not rushing) try to explore the fixes you suggested above to actually get it working, but given the discussions so far, the math sounds complicated :)

However, Java cannot have unsigned byte values that are > 127.

Yeah this is always so annoying :) And if you try to negate the special -128 byte value, you get -128 back. And (byte) Math.abs(-128) also returns -128. This makes for evil interview line of questioning...

benwtrent commented 3 months ago

@MilindShyani I expanded your equation myself again to double check.

Seems similar?

(NOTE: scale is just the inverse of alpha). It seems to me that:

byte = (float - minQuantile) * alpha - c

This implies that float is actually:

scale(byte + c) + minQuantile = Y

So that this is more readable, lets' assume our two byte values are x & y, m=minQuantile, and alpha=a and the corrective constant is c

Expanded out algebra:

a^2 c^2 + a^2 c x + a^2 c y + a^2 x y + 2 a c m + a m x + a m y + m^2

a^2 c^2 can be pre-calculated. Now, since both correction offsets will have this, I will simply divid it by 2

a^2 c x & a^2 c y can be pre-calculated and added

2 a c m is precalculated, but since both corrections will have it, lets divide by 2 again.

Let's make the assumption that the other values are already calculated.

I am getting positive results with this. I will return back once I run some tests.

benwtrent commented 3 months ago

@naveentatikonda @mikemccand https://github.com/apache/lucene/compare/main...benwtrent:fix-8-bit?expand=1

That is my current WIP

benwtrent commented 3 months ago

With this change, I get the following recall@100 over 500_000 coherev2 (using max-inner-product as designed for the model):

recall maxConn  beamWidth
0.872  16       500

Int8 with my change:

recall maxConn  beamWidth
0.841  16       500

So, recall is better than before for max-inner-product, but still worse than int7. I am wondering if we are running to numeric limitations because the signed correction could get very large.

naveentatikonda commented 3 months ago

@benwtrent The new numbers looks good. But, I have few questions based on your code changes :

Why are we setting SIGNED_CORRECTION as 127 instead of 128 ?
Also, here the min should be -128 but you are using -127
In this new code we are not adding this rounding error correction((dx - dxq) * dxq) like you are doing in the regular implementation. Is this not required ? Last time when I tried to run a test without this rounding error corrective offset using 7 bits there is a huge drop in recall.

benwtrent commented 3 months ago

@naveentatikonda this is a WIP :).

Why are we setting SIGNED_CORRECTION as 127 instead of 128 ?

My logic here was scaling by what we will round by. But thinking about it, I think we should scale down by 128.

...the min should be -128

I disagree. We want -127, 127. Otherwise we have to handle the exotic -128*-128 case when doing vector comparisons and this disallows some nice SIMD optimizations down the road.

In this new code we are not adding this...

This is a wip...I didn't figure out yet how to incorporate this rounding error as it assumes unsigned operations and has a knock-on effect on the buckets. So, simply adding the naive rounding error would reduce recall significantly. I am wondering how to apply it. Will report back if I can figure it out :).

benwtrent commented 3 months ago

I pushed a change to my branch.

Scales by 128
Rounds to between -127, 127
Applies a modification of the rounding error compensation. The idea behind how I applied the different compensation now is that we need to handle the linear scaling as this can cause us to snap to different buckets. I am still not 100% sure my transformation is correct there.

Recall over the same data is now: 0.877.

naveentatikonda commented 3 months ago

I disagree. We want -127, 127. Otherwise we have to handle the exotic -128*-128 case when doing vector comparisons and this disallows some nice SIMD optimizations down the road.

@benwtrent Sorry, didn't get it. Can you pls share more details or links related to this.

Applies a modification of the rounding error compensation. The idea behind how I applied the different compensation now is that we need to handle the linear scaling as this can cause us to snap to different buckets. I am still not 100% sure my transformation is correct there.

@MilindShyani Can you check and comment on this. (dx - dxq) * (dxq - SIGNED_CORRECTION * alpha) is the rounding error compensation added by Ben.

jbhateja commented 3 months ago

I need to know more about signed representation (Two's vs Ones' complement), but can't we map 0 - 255 to -127 to 128 using shifts ? If we do so,

for unsigned to signed value range conversions, you can just flip the sign bit, for reverse case add min_signed value to signed value to convert to unsigned value range

benwtrent commented 2 months ago

@jbhateja could you unpack how this would actually work when using dot-product with linear scale corrections?

I would imagine we could switch to an "unsigned byte" comparison and thus handle the bit flipping directly in SIMD, but this would introduce a new vector comparison that needs to be added.

I am not sure how we can just "flip the bit" while maintain distance comparisons for euclidean & dotProduct without introducing error or slowing things down significantly.

MilindShyani commented 2 months ago

@benwtrent Apologies for the late response! I am traveling (and marveling 2000 year old pyramids) right now. The transformation you wrote indeed matches mine. Thinking about the bucketing correction meanwhile, will post shortly. Thanks!

benwtrent commented 2 months ago

@MilindShyani no worries dude! Enjoy your vacation and don't stress about this, it will still be here when you get back to the office :)

MilindShyani commented 2 months ago

Hello!

So I just did the calculation and I get the following

1) I get an innocuous sign discrepancy, i.e., the term - SIGNED_CORRECTION * alpha should instead be +SIGNED_CORRECTION * alpha I will double check this. 2) I get an extra term. This is rather serious since it would also affect the unsigned math. My correction looks like

(dx - dxq)*(dxq + \alpha * c + \alpha * min_quantile)

The last term is the extra term, which is present even when c = Signed_correction = 0. What do you think of the extra term @benwtrent ?

benwtrent commented 2 months ago

@MilindShyani OK, I did some more benchmarking.

I tried switching to + & with your full correction term and recall significantly dropped to 0.518

I tried the follow permutations and they were all the same recall of 0.877.

(dx - dxq) * (dxq - SIGNED_CORRECTION * alpha)

(dx - dxq) * (dxq - SIGNED_CORRECTION * alpha - alpha * minQuantile)

(dx - dxq) * (dxq - SIGNED_CORRECTION * alpha + alpha * minQuantile)

This makes me think the additional term of alpha * minQuantize is bringing nothing to the table, even for non-unit vectors (where I would expect quantiles & alpha to be more extreme).

MilindShyani commented 2 months ago

Hi @benwtrent, I am trying to carefully check the math vs code. The first thing I noticed is that my definitions of dxs and dxq don't necessarily match with yours. That might be the origin for the sign error. I will uniformize the notation and get back.

But meanwhile, in the latest code https://github.com/apache/lucene/compare/main...benwtrent:fix-8-bit?expand=1 I find that float dxs = scale * dxc; but scale is defined as 127/max-min. Shouldn't the 127 actually be 255? FYI That's what I used in my transformation, i.e., first, map the float to 0 to 255 and then shift by 128.

benwtrent commented 2 months ago

@MilindShyani that is just a bad comment. All that will have to be cleaned up. The 127 is when the int7 is the default.

You can see in the ctor, we use the provided bits to create the appropriate number.

MilindShyani commented 2 months ago

The math is quite simple as we saw above, but the code (I guess because its trying to do 7 and 8 bits at the same time) is giving me a really hard time 😅

How is the quadratic term being calculated? Is it

dx_q * dy_q using the new definition of _qvariables or using the old?

Recall that new definition is dx_q = Math.round(dx*scale - c) * alpha while the old definition was dx_q = Math.round(dx*scale) * alpha

In any case, you have worked with the code far longer than I have, so if you are confident about it please go ahead and commit :)

benwtrent commented 2 months ago

In any case, you have worked with the code far longer than I have, so if you are confident about it please go ahead and commit :)

@naveentatikonda could you test my branch as well? I want to make sure my numbers were not flukes on my end.

We should also test with different data sets & different similarity metrics. We should verify that euclidean is also fixed, etc.

naveentatikonda commented 2 months ago

@naveentatikonda could you test my branch as well? I want to make sure my numbers were not flukes on my end.

@benwtrent Apologies for my late response. I wasn't feeling well last week. Will run the tests and share the results this week.

naveentatikonda commented 2 months ago

@benwtrent I ran some tests with changes in your branch for 8 bits and the recall for L2 is almost same as what you got. But, recall for innerproduct and cosinesimilarity spacetypes are still bad.

SpaceType	Dataset	Recall
L2	Cohere	0.86
InnerProduct	Cohere	0.0026
Cosine	glove-200	0.0037

Besides this, I observed another recall issue with 4 bits for cosinesimilarity when tested with glove-200 dataset.

Bits	Compress	Recall
7	False	0.72
4	False	0.107
4	True	0.089

benwtrent commented 2 months ago

@naveentatikonda cohere v2 (768 dims)?

I was getting decent recall with dot product over various datasets. Maybe I have a data set sampling bias.

I will try the other you mentioned

benwtrent commented 2 months ago

I just noticed that I might not have pushed up my branch. But I will rerun my tests to verify:

https://github.com/apache/lucene/compare/main...benwtrent:lucene:fix-8-bit

As for int4, I wouldn't expect it to have very good recall with any statically set confidence interval. But I can attempt to replicate your results.

Also, there should be NO difference in recall between compressed or not. Something weird is happening there.

naveentatikonda commented 2 months ago

@naveentatikonda cohere v2 (768 dims)?

yes

I was getting decent recall with dot product over various datasets. Maybe I have a data set sampling bias.

Inner Product - cohere-wiki-1m L2 - Cohere-wiki. But, with 475K vectors cosine - GloVe-200-Angular

benwtrent commented 2 months ago

@naveentatikonda AH, I see what I did, I pushed one of my experiments to that branch not an actual good change. Sorry for the false alarm. i will correct asap.

naveentatikonda commented 2 months ago

I just noticed that I might not have pushed up my branch. But I will rerun my tests to verify:

main...benwtrent:lucene:fix-8-bit

I can test it again with your changes once you push it.

As for int4, I wouldn't expect it to have very good recall with any statically set confidence interval. But I can attempt to replicate your results.

No, it was tested with dynamic confidence interval. Just now I reran for 4 bits without compress using 0 as confidence interval and the recall is still just 0.13.

Also, there should be NO difference in recall between compressed or not. Something weird is happening there.

Oh I see...let me test them again

naveentatikonda commented 2 months ago

@naveentatikonda AH, I see what I did, I pushed one of my experiments to that branch not an actual good change. Sorry for the false alarm. i will correct asap.

No worries, pls let me know once they are ready. I will test them again.

benwtrent commented 2 months ago

@naveentatikonda OK, I have pushed to https://github.com/apache/lucene/compare/main...benwtrent:lucene:fix-8-bit?expand=1

I am getting similar recall to int7, but honestly, it doesn't seem better at all and at times worse. It just doesn't seem worth the complexity from a recall perspective unless we can figure out why recall isn't significantly better.

benwtrent commented 2 months ago

@naveentatikonda I opened an issue for the int4 & glove200. Interesting to be sure. I wonder if we are suffering because its a statistical based model, or if its just due to the lower dimension count: https://github.com/apache/lucene/issues/13614

One interesting finding, is statically setting the confidence interval very low (lower than is currently allowed in Lucene) makes recall way better.

FWIW, this is the opposite of what we found from transformer based models, where the dynamic interval was almost a necessity.

naveentatikonda commented 2 months ago

I am getting similar recall to int7, but honestly, it doesn't seem better at all and at times worse. It just doesn't seem worth the complexity from a recall perspective unless we can figure out why recall isn't significantly better.

Makes sense. But, have you tested with other metric types and is there any improvement in recall for innerproduct or cosine ?

apache / lucene

Significant drop in recall for 8 bit Scalar Quantizer #13519

Description

Root Cause