Closed naveentatikonda closed 2 weeks ago
@naveentatikonda I opened an issue for the int4 & glove200. Interesting to be sure. I wonder if we are suffering because its a statistical based model, or if its just due to the lower dimension count: #13614
One interesting finding, is statically setting the confidence interval very low (lower than is currently allowed in Lucene) makes recall way better.
FWIW, this is the opposite of what we found from transformer based models, where the dynamic interval was almost a necessity.
@benwtrent Just saw the github issue. This looks interesting. Will try to test with some other cosine dataset with higher dimension to validate and rule out these possibilities. Thanks!
I just tested KNN recall using knnPerfTest.py
from luceneutil
on 4, 7, 8 bit quantization, and still see 8 bit quantization broken.
This is with Cohere (768 dimension) vectors, 250K docs, 32 maxConn
, 50 beamWidthIndex
, 20 fanout
.
For EUCLIDEAN
:
recall latency nDoc fanout maxConn beamWidth quantized visited index ms selectivity filterType
0.541 1.27 250000 20 32 50 4 bits 7156 18786 1.00 post-filter
0.886 1.18 250000 20 32 50 7 bits 6763 17791 1.00 post-filter
0.038 1.74 250000 20 32 50 8 bits 10066 26265 1.00 post-filter
And DOT_PRODUCT
(angular):
recall latency nDoc fanout maxConn beamWidth quantized visited index ms selectivity filterType
0.497 0.96 250000 20 32 50 4 bits 4903 16632 1.00 post-filter
0.771 0.87 250000 20 32 50 7 bits 4319 15565 1.00 post-filter
0.003 0.92 250000 20 32 50 8 bits 9157 30284 1.00 post-filter
And COSINE
:
recall latency nDoc fanout maxConn beamWidth quantized visited index ms selectivity filterType
0.531 1.23 250000 20 32 50 4 bits 6816 20618 1.00 post-filter
0.650 1.22 250000 20 32 50 7 bits 6921 19454 1.00 post-filter
0.002 1.00 250000 20 32 50 8 bits 8692 188290 1.00 post-filter
Should we maybe just remove 8 bit support?
From the discussion above it sounds like even the fixes we are testing are not much better than 7 bit, and add substantial code complexity?
In any event, I think this should be a blocker for 9.12 / 10.0? We should do something before releasing (fix 8 bit case, or remove it)...
(It's also entirely possible I am making some sort of silly mistake trying to run this tooling that I do not fully understand, heh).
If nobody else jumps on in the next day or so, I'll work up a PR to remove int8
for now soon...
@mikemccand that makes sense to me. All the numerics we are messing with here shows that we are hitting some weird edge cases where int8 just isn't worth it if it remains signed & we attempt to accurate scale the linear transformation of the scores.
I also don't have cycles right now to dig further. Though I welcome others attempts.
My gut reaction is that the only way to handle this for int8
is to go "full unsigned" and have some custom scoring logic that transforms the bytes as unsigned, etc. though that adds significant code to our vector util's, etc. (see my very old and probably now defunct draft: https://github.com/apache/lucene/pull/12694)
Of course, keeping 7 bits and 4 bits and just removing 8 bits ;) @mikemccand
++ to disallowing int8 in the Scalar Quantized format.
Description
Based on some of the benchmarking tests that I ran from OpenSearch, there is a significant drop in recall ( appx. 0.03) for 8 bits irrespective of space type, confidence interval or few other parameters. For the same configuration, the recall for 7 bits is atleast greater than 0.85.
Root Cause
As part of quantization, after normalizing each dimension of the vector into [0 to 2^bits - 1] range, we are casting it into byte to bring it into byte range of [-128 to 127]. For 7 bits, we are normalizing each value into 0 to 127 which is already in the byte range. So, there is no rotation or shifting of data. But, for 8 bits any vector dimension which is within 128 to 255 range after normalization, the sign and magnitude changes when it is type casted into byte which leads to a non-uniform shifting or distribution of data. As per my understanding, this is the potential root cause for this huge drop in recall.
To validate this, I have updated the quantization code and tested it against L2 space type by linearly shifting (subtracting 128) each dimension after normalizing it into 0 to 255 range such that each dimension is uniformly distributed and within the byte range(finally round it and clip it to handle edge cases) of -128 to 127. With these changes, we get a min. recall of 0.86 for the same configuration.
Note - The below pseudo code is not a fix and it is a different quantization technique used to validate the root cause. This works only for L2 spacetype because L2 is shift invariant but other spacetypes like cosinesimil and inner product are not shift invariant.
@benwtrent @mikemccand Can you please take a look and confirm if you see this issue when tested with lucene-util?