Significant drop in recall for 8 bit Scalar Quantizer

naveentatikonda commented 3 months ago

Description

Based on some of the benchmarking tests that I ran from OpenSearch, there is a significant drop in recall ( appx. 0.03) for 8 bits irrespective of space type, confidence interval or few other parameters. For the same configuration, the recall for 7 bits is atleast greater than 0.85.

Root Cause

As part of quantization, after normalizing each dimension of the vector into [0 to 2^bits - 1] range, we are casting it into byte to bring it into byte range of [-128 to 127]. For 7 bits, we are normalizing each value into 0 to 127 which is already in the byte range. So, there is no rotation or shifting of data. But, for 8 bits any vector dimension which is within 128 to 255 range after normalization, the sign and magnitude changes when it is type casted into byte which leads to a non-uniform shifting or distribution of data. As per my understanding, this is the potential root cause for this huge drop in recall.

// pseudo code for existing implementation in Lucene SQ for 8 bits

    float dx = v - minQuantile; // v is float vector value in a vector at destIndex
    float dxc = Max(minQuantile, Min(maxQuantile, v)) - minQuantile;

    // Normalize it into 0 to 255 range for 8 bits
    // scale = 255/(maxQuantile - minQuantile)
    float dxs = scale * dxc;

    if (dest != null) {
      dest[destIndex] = (byte) Math.round(dxs); // type casting into byte after normalization
    }

To validate this, I have updated the quantization code and tested it against L2 space type by linearly shifting (subtracting 128) each dimension after normalizing it into 0 to 255 range such that each dimension is uniformly distributed and within the byte range(finally round it and clip it to handle edge cases) of -128 to 127. With these changes, we get a min. recall of 0.86 for the same configuration.

Note - The below pseudo code is not a fix and it is a different quantization technique used to validate the root cause. This works only for L2 spacetype because L2 is shift invariant but other spacetypes like cosinesimil and inner product are not shift invariant.

// pseudo code for custom changes in Lucene SQ for 8 bits and L2 space type

    float dx = v - minQuantile; // v is float vector value in a vector at destIndex
    float dxc = Max(minQuantile, Min(maxQuantile, v)) - minQuantile;

    // Normalize it into 0 to 255 range for 8 bits
    // scale = 255/(maxQuantile - minQuantile)
    float dxs = scale * dxc;

    float a = Math.round(dxs-128) // subtract 128 to bring it into -128 to 127 byte range

    if (a > 127)
    a = 127

    if(a < -128)
    a = -128

    if (dest != null) {
      dest[destIndex] = (byte) a // type casting into byte 
    }

Beam Width	Max Connections	Dataset	SpaceType	Dimension	confidence interval
100	16	Cohere-wiki	L2	768	default

Bits	Primary Shards	Recall
8	1	0.03
7	1	0.85
4	1	0.57
8 (custom changes)	1	0.86
8 (custom changes)	4	0.93

@benwtrent @mikemccand Can you please take a look and confirm if you see this issue when tested with lucene-util?

naveentatikonda commented 2 months ago

@naveentatikonda I opened an issue for the int4 & glove200. Interesting to be sure. I wonder if we are suffering because its a statistical based model, or if its just due to the lower dimension count: #13614

One interesting finding, is statically setting the confidence interval very low (lower than is currently allowed in Lucene) makes recall way better.

FWIW, this is the opposite of what we found from transformer based models, where the dynamic interval was almost a necessity.

@benwtrent Just saw the github issue. This looks interesting. Will try to test with some other cosine dataset with higher dimension to validate and rule out these possibilities. Thanks!

mikemccand commented 1 month ago

I just tested KNN recall using knnPerfTest.py from luceneutil on 4, 7, 8 bit quantization, and still see 8 bit quantization broken.

This is with Cohere (768 dimension) vectors, 250K docs, 32 maxConn, 50 beamWidthIndex, 20 fanout.

For EUCLIDEAN:

recall  latency nDoc    fanout  maxConn beamWidth       quantized       visited index ms        selectivity     filterType
0.541    1.27   250000  20      32      50      4 bits  7156    18786   1.00    post-filter
0.886    1.18   250000  20      32      50      7 bits  6763    17791   1.00    post-filter
0.038    1.74   250000  20      32      50      8 bits  10066   26265   1.00    post-filter

And DOT_PRODUCT (angular):

recall  latency nDoc    fanout  maxConn beamWidth       quantized       visited index ms        selectivity     filterType
0.497    0.96   250000  20      32      50      4 bits  4903    16632   1.00    post-filter
0.771    0.87   250000  20      32      50      7 bits  4319    15565   1.00    post-filter
0.003    0.92   250000  20      32      50      8 bits  9157    30284   1.00    post-filter

And COSINE:

recall  latency nDoc    fanout  maxConn beamWidth       quantized       visited index ms        selectivity     filterType
0.531    1.23   250000  20      32      50      4 bits  6816    20618   1.00    post-filter
0.650    1.22   250000  20      32      50      7 bits  6921    19454   1.00    post-filter
0.002    1.00   250000  20      32      50      8 bits  8692    188290  1.00    post-filter

Should we maybe just remove 8 bit support?

From the discussion above it sounds like even the fixes we are testing are not much better than 7 bit, and add substantial code complexity?

In any event, I think this should be a blocker for 9.12 / 10.0? We should do something before releasing (fix 8 bit case, or remove it)...

(It's also entirely possible I am making some sort of silly mistake trying to run this tooling that I do not fully understand, heh).

mikemccand commented 3 weeks ago

If nobody else jumps on in the next day or so, I'll work up a PR to remove int8 for now soon...

benwtrent commented 3 weeks ago

@mikemccand that makes sense to me. All the numerics we are messing with here shows that we are hitting some weird edge cases where int8 just isn't worth it if it remains signed & we attempt to accurate scale the linear transformation of the scores.

I also don't have cycles right now to dig further. Though I welcome others attempts.

My gut reaction is that the only way to handle this for int8 is to go "full unsigned" and have some custom scoring logic that transforms the bytes as unsigned, etc. though that adds significant code to our vector util's, etc. (see my very old and probably now defunct draft: https://github.com/apache/lucene/pull/12694)

benwtrent commented 3 weeks ago

Of course, keeping 7 bits and 4 bits and just removing 8 bits ;) @mikemccand

ChrisHegarty commented 3 weeks ago

++ to disallowing int8 in the Scalar Quantized format.

apache / lucene

Significant drop in recall for 8 bit Scalar Quantizer #13519

Description

Root Cause