Closed naveentatikonda closed 2 weeks ago
Hmm is this the same issue as #13519 er sorry I meant #13350.
OK I managed to run knnPerfTest.py
from luceneutil
, using mpnet
vectors (768 dims) and I think I am also seeing horrific performance for int8
but OK for int4
and int7
:
quantizedBits recall latency nDoc fanout maxConn beamWidth visited index ms
32 0.983 2.55 250000 20 64 250 9294 66542 1.00 post-filter
4 0.645 2.13 250000 20 64 250 13010 79605 1.00 post-filter
7 0.943 1.87 250000 20 64 250 11775 78806 1.00 post-filter
8 0.002 2.81 250000 20 64 250 23879 79177 1.00 post-filter
NOTE: this is my first time successfully running knnPerfTest.py
so it's entirely possible I messed something up! But given that I'm seeing decent recall for unquantized (32 bit) and 7 bit, I think the 8 bit result is believable and horrible.
Thank you @naveentatikonda for your investigation here! I am busy with some other things, but I want to circle back here and figure out how we can make int8
better.
Thank you @naveentatikonda for your investigation here!
+1!
I also ran luceneutil's knnPerfTest.py
with the cohere-768-IP
dataset (linked on #13350) and int8
also looks buggy:
quantizedBits recall latency nDoc fanout maxConn beamWidth visited index ms
32 0.957 12.81 1000000 20 64 250 28918 647759 1.00 post-filter
4 0.517 7.24 1000000 20 64 250 23293 1114730 1.00 post-filter
7 0.876 6.42 1000000 20 64 250 20291 1034489 1.00 post-filter
8 0.006 12.99 1000000 20 64 250 43487 1214556 1.00 post-filter
@naveentatikonda it looks like you have some changes to Lucene's sources that might fix the buggy quantization? Could you maybe open a PR, or post a git diff
output here or so? Quantized math is hard!
Hmm is this the same issue as ~#13519~ er sorry I meant #13350.
@mikemccand First of all, sorry for the confusion, this issue(#13350) was actually created for int7 SQ using Inner Product. There was some misunderstanding, looking at this PR(#12582) I thought int8 was released in Lucene 9.10 but later realized it was 7 bits. That issue is resolved after using default confidence interval instead of 0.9. Will update and close that issue.
But, this issue exists with 8 bits for all the space types (tested on L2, inner product and cosine) and with all the confidence intervals (tested with 0.9, 1.0, default, 0)
@naveentatikonda do you have some ideas on how you fix the 8-bit quantization. Atleast, I see for L2 you suggested a change. Can you raise a small git diff to show how this can be fixed. You can add fix for L2 first and then IP fix can be added iteratively.
I also ran luceneutil's
knnPerfTest.py
with thecohere-768-IP
dataset (linked on #13350) andint8
also looks buggy:quantizedBits recall latency nDoc fanout maxConn beamWidth visited index ms 32 0.957 12.81 1000000 20 64 250 28918 647759 1.00 post-filter 4 0.517 7.24 1000000 20 64 250 23293 1114730 1.00 post-filter 7 0.876 6.42 1000000 20 64 250 20291 1034489 1.00 post-filter 8 0.006 12.99 1000000 20 64 250 43487 1214556 1.00 post-filter
@naveentatikonda it looks like you have some changes to Lucene's sources that might fix the buggy quantization? Could you maybe open a PR, or post a
git diff
output here or so? Quantized math is hard!
Thanks for the confirmation. I still don't have a solution which works out of the box for all 4,7 or 8 bits and for all space types, still working with my data science team (@milindshyani). The pseudo code I shared above works only for L2 with 8 bits, it was added just to validate the root cause. Will update if we find anything interesting that would solve the issue.
@mikemccand As you have mentioned I created a draft version of PR(#13526) which will atleast fix the L2 for 8 bits without breaking the existing functionality. We can work on this together and come up with a better solution which will fix all the space types. Thanks!
@benwtrent Can you please help me understand the following:
compress
flag to True
to reduce the memory usage by about 50% (theoretically) compared to 8 bits ?4. Sorry, might be a dumb question. Does compress only works for 4 bits ?
I was also curious about this compress
option -- it is javadoc'd here and if I'm reading it correctly, yes, I think it only has an effect with int4
(and maybe in the future if we get to int2
, int1
, int0.5
, etc., it will as well).
I think this applies to the quantized vectors, which are (offheap) hot during searching. So if you pass compress=true
, those quantized vectors will take 50% of the (offheap) hot RAM as compressed=false
, in exchange for some added CPU cost. But note that the HNSW graph is also hot during searching ... not sure how much RAM (I think also off-heap?) it will need vs the vectors, so you won't see fully 50% less hot RAM overall.
My concern for 8 bit quantization is the algebraic expansion of dot-product and the corrective terms.
For scalar quantization, the score corrections for dotProduct are derivable via some simple algebra, but I am not immediately aware of a way to handle the sign switch. I didn't bother digging deeper there as int7 provides basically the exact same recall. I am eager to see if 8bit can be applied while keeping the score corrections.
In case you need it, here is valuable background:
https://www.elastic.co/search-labs/blog/scalar-quantization-101
For some background on the small additional correction provided for int4 (or any scalar quantization where confidence_interval is set to 0
):
https://www.elastic.co/search-labs/blog/vector-db-optimized-scalar-quantization
Let me see if I can answer all the other questions (Sorry if I missed any, 2nd thread related to scalar quantization and I might be conflating different things).
In terms of quantization, are we doing any extra processing for 4 and 7 bits when compared to 8 bits ? I believe not.
Typically not. But, int4 honestly needs dynamic confidence intervals to work. You cannot statically set the confidence interval if you want good recall without a ton of oversampling. Setting the confidence_interval to 0
is an indication that you want the quantiles to be dynamically calculated (not statically calculated via some confidence interval).
For 7 bits, how are we reducing memory usage compared to 8 bits. Are we doing any extra compression somewhere. Am I missing something ?
No, we are not. There are nice SIMD performance properties for dot-product, but similar nice properties can be applied if the signed byte limits are between -127
and 127
For 4 bits, should we must set compress flag to True to reduce the memory usage by about 50% (theoretically) compared to 8 bits ?
Correct.
might be a dumb question. Does compress only works for 4 bits ?
The only dumb question is an unasked one.
Correct. Only for 4 bits. I seriously considered always compressing (thus there being no parameter), but the performance hit was too significant. I was able to close it significantly over time through targeted Panama Vector APIs. I hope that we can move away from having an "uncompressed" version once we get off-heap scoring for quantized vectors. I have a draft PR for this, but I am running into weird perf issues and haven't yet been able to dig more deeply.
I think this applies to the quantized vectors, which are (offheap) hot during searching.
Absolutely correct.
not sure how much RAM (I think also off-heap?) it will need vs the vectors
Way way less. The main cost is the vectors themselves. The graph is way smaller (we do delta variable encoding for the neighbors). The graph size (per layer) is:
int
per vector in that layerint
and its delta & variably encoded neighbors. This obviously changes based on the number of connections configured. Consider the WORST case (where the delta & variable encoding does NOTHING), the base layer would have 32 connections. So, that is 33 * 4 bytes per vector for base layer of the graph, if its fully connected. This is way less than the vectors as vector dimensions are usually many hundreds (384 is the smallest performant model I have found, e5small).
@benwtrent Thanks for answering all my questions. Thinking from the customer POV, is there any advantage of using 7 bits over 8 bits especially any improvement in performance? One thing on the positive side as of now is as you mentioned there is no sign switch issue while using dot product. But, on the down side there will be an impact on the recall because with 7 bits we are quantizing data to a smaller range compared to 8 bits.
@naveentatikonda there is indeed a small recall reduction.
But I would need to dig further on how to do signed 8 but while maintaining the expected scoring.
Do you know how to do the algebraic expansion for signed 8 bit?
I do not want customers to get dramatically different scores between their raw vectors and their scalar quantized vectors.
Do you know how to do the algebraic expansion for signed 8 bit?
Sorry, I don't know. @milindshyani do you know how to do it ?
The point of the algebra is to ensure we end up not with just similar rankings, but also similar scores for quantized vectors.
For Euclidean, many terms cancel out and we just end up with the euclidean difference of the quantized values.
For dot product, we have additional values that relate to the quantiles utilized and the dimensional values of the vectors themselves.
I would imagine that the formulae are not qualitatively different since singed vs unsigned are related by a linear transformation, i.e.,
x_new = 2*x_old - 127 -- (1).
If x_old has a range between 0 and 127, then x_new has a range between -127 and 127. We can still use the correction from the (very well written) blog,
Correction(x_i,x_j) = \alpha^2 x_i x_j + \alpha x_i min + + \alpha x_j min + min^2 --- (2)
but substitute x_old = (x_new+127)/2 in (2) to obtain the correction in terms of x_new (i.e. unsigned numbers). What do you guys think?
The only way to find out is to test it. I don't see how your suggestion would work without trying it out.
Its better to think about what it would be in the unsigned byte
case. A scale from 0-255
is what int8
quantization naturally results in. Then to make it signed, you have a change to -127-127
.
I don't immediately see how to adjust the correction to handle the switch from 0 * 255
to -127 * 127
.
@naveentatikonda any solution should support max-inner-product, any testing with normalized vectors & dot-product might result in false assurance. If you are using Lucene util, I would recommend using some older cohere embeddings (with their model from before CohereV3), and using the -metric
mip
.
My bad, the correction I proposed was handling 0 to 127 to -127 to 127 which does not make sense (unless you are on the real line, which is what tripped me).
I need to know more about signed representation (Two's vs Ones' complement), but cant we map 0 - 255
to -127 to 128
using shifts ? If we do so,
Dot product of floats as approximated by signed integers =
i_1 i_2 \alpha^2 + i_1 \alpha m + i_2 \alpha m + i_1\alpha^2 c + i_2 \alpha^2 c + 2m\alpha c + c^2 \alpha^2+ m^2
Where alpha = (max - min)/255, m = min, c = 128, and i_1 and i_2 are the signed ints.
The above was obtained by writing signed_int = (float - min)/(max - min) 255 - c and then expanding float_1float_2.
Note that for c = 0, the above answer should reduce to the unsigned int case.
My concern for 8 bit quantization is the algebraic expansion of dot-product and the corrective terms.
For scalar quantization, the score corrections for dotProduct are derivable via some simple algebra, but I am not immediately aware of a way to handle the sign switch. I didn't bother digging deeper there as int7 provides basically the exact same recall. I am eager to see if 8bit can be applied while keeping the score corrections.
In case you need it, here is valuable background:
https://www.elastic.co/search-labs/blog/scalar-quantization-101
For some background on the small additional correction provided for int4 (or any scalar quantization where confidence_interval is set to
0
):https://www.elastic.co/search-labs/blog/vector-db-optimized-scalar-quantization
Thanks @benwtrent -- this is very helpful. I think I understand better :)
I see that we are forced to conflate quantization and distance metric because we want to compute the distance metric directly in quantized space using efficient SIMD instructions.
I.e. one might ideally consider quantization mathematically as being its own problem regardless of distance metrics: you map float32
down to int8
or int4
etc., and it's bidirectional. You can re-map a quantized vector back to float32
(introducing quantization noise of course). But that would be too inefficient (de-quantizing then computing distance metrics in float32
space) ... so we must conflate.
Does the Lucene implementation have per-dimension calculation of quantiles/min/max? Or is it really a global min/max computed across all dimensions and all vectors' values in those dimensions? Scalar-quantization-101 seems to imply it's the latter. Do ML vectors "typically" have similar distributions of values within each dimension?
Could someone summarize what is actually wrong with Lucene here? From my luceneutil
tests it looks like int8
with DOT_PRODUCT
is buggy / does not work (.006 recall on cohere-768-IP
where float32
gets 0.957).
E.g. int8
quantization only works if your vectors are unsigned, or int8
quantization only works with VectorSimilarityFunction.EUCLIDEAN
or so?
I know we are discussing how to make the conflated quantization and distance metric combinations math work out (i.e. how to fix the issue), but I'm just trying to get the big picture of what the issue even is. How would we explain the buggy behavior to users?
@mikemccand
Does the Lucene implementation have per-dimension calculation of quantiles/min/max? Or is it really a global min/max computed across all dimensions and all vectors' values in those dimensions?
Indeed, it is the latter, over all dimensions.
Do ML vectors "typically" have similar distributions of values within each dimension?
For some vectors, this is true, for others, not so. A good example of vectors where this is NOT the case is CohereV2. However, the error distribution per dimension is independent and given the central limit theorem, the errors do not compound.
I know we are discussing how to make the conflated quantization and distance metric combinations math work out (i.e. how to fix the issue), but I'm just trying to get the big picture of what the issue even is. How would we explain the buggy behavior to users?
The scalar quantizer currently assumes the destination byte
values are unsigned. However, Java cannot have unsigned byte values that are > 127
. Consequently, when trying to utilize the full number of bits in int8
the scoring falls apart.
There are actually two ways to fix this bug:
Looking back, when I finished up the ScalarQuantization support for int4, I should have disallowed 8
bits as a parameter to prevent this confusion. I allowed it and I shouldn't have.
I know we are discussing how to make the conflated quantization and distance metric combinations math work out (i.e. how to fix the issue), but I'm just trying to get the big picture of what the issue even is. How would we explain the buggy behavior to users?
The scalar quantizer currently assumes the destination
byte
values are unsigned. However, Java cannot have unsigned byte values that are> 127
. Consequently, when trying to utilize the full number of bits inint8
the scoring falls apart.
OK, thanks. Regardless of the distance metric, int8
is buggy today. (Hmm, are there any special cases where int8
is working? I see blog posts about using int8
in Elasticsearch...). I would think even if your vectors are e.g. always non-negative in all values, int8
quantization will still try (in a buggy way) to use the full 8 bits, and will still be buggy.
Looking back, when I finished up the ScalarQuantization support for int4, I should have disallowed
8
bits as a parameter to prevent this confusion. I allowed it and I shouldn't have.
Well since it doesn't work at all today, I think we can freely change/fix the behavior, e.g. in 9.12 or even a 9.11.2? We could simply remove support for int8
? Anyone who is using it today is getting terrible results so they really should not be using it, and on upgrade, when they see Lucene no longer supports it, they can fix their code to either go to int4
or int7
?
We could (later, not rushing) try to explore the fixes you suggested above to actually get it working, but given the discussions so far, the math sounds complicated :)
However, Java cannot have unsigned byte values that are
> 127
.
Yeah this is always so annoying :) And if you try to negate the special -128
byte value, you get -128
back. And (byte) Math.abs(-128)
also returns -128. This makes for evil interview line of questioning...
@MilindShyani I expanded your equation myself again to double check.
Seems similar?
(NOTE: scale
is just the inverse of alpha).
It seems to me that:
byte = (float - minQuantile) * alpha - c
This implies that float is actually:
scale(byte + c) + minQuantile = Y
So that this is more readable, lets' assume our two byte
values are x
& y
, m=minQuantile
, and alpha=a
and the corrective constant is c
Expanded out algebra:
a^2 c^2 + a^2 c x + a^2 c y + a^2 x y + 2 a c m + a m x + a m y + m^2
a^2 c^2
can be pre-calculated. Now, since both correction offsets will have this, I will simply divid it by 2
a^2 c x
& a^2 c y
can be pre-calculated and added
2 a c m
is precalculated, but since both corrections will have it, lets divide by 2 again.
Let's make the assumption that the other values are already calculated.
I am getting positive results with this. I will return back once I run some tests.
@naveentatikonda @mikemccand https://github.com/apache/lucene/compare/main...benwtrent:fix-8-bit?expand=1
That is my current WIP
With this change, I get the following recall@100 over 500_000 coherev2 (using max-inner-product as designed for the model):
recall maxConn beamWidth
0.872 16 500
Int8 with my change:
recall maxConn beamWidth
0.841 16 500
So, recall is better than before for max-inner-product, but still worse than int7. I am wondering if we are running to numeric limitations because the signed correction could get very large.
@benwtrent The new numbers looks good. But, I have few questions based on your code changes :
SIGNED_CORRECTION
as 127 instead of 128 ?(dx - dxq) * dxq
) like you are doing in the regular implementation. Is this not required ? Last time when I tried to run a test without this rounding error corrective offset using 7 bits there is a huge drop in recall.@naveentatikonda this is a WIP :).
Why are we setting SIGNED_CORRECTION as 127 instead of 128 ?
My logic here was scaling by what we will round by. But thinking about it, I think we should scale down by 128.
...the min should be -128
I disagree. We want -127, 127
. Otherwise we have to handle the exotic -128*-128
case when doing vector comparisons and this disallows some nice SIMD optimizations down the road.
In this new code we are not adding this...
This is a wip...I didn't figure out yet how to incorporate this rounding error as it assumes unsigned operations and has a knock-on effect on the buckets. So, simply adding the naive rounding error would reduce recall significantly. I am wondering how to apply it. Will report back if I can figure it out :).
I pushed a change to my branch.
Recall over the same data is now: 0.877
.
I disagree. We want
-127, 127
. Otherwise we have to handle the exotic-128*-128
case when doing vector comparisons and this disallows some nice SIMD optimizations down the road.@benwtrent Sorry, didn't get it. Can you pls share more details or links related to this.
- Applies a modification of the rounding error compensation. The idea behind how I applied the different compensation now is that we need to handle the linear scaling as this can cause us to snap to different buckets. I am still not 100% sure my transformation is correct there.
@MilindShyani Can you check and comment on this. (dx - dxq) * (dxq - SIGNED_CORRECTION * alpha)
is the rounding error compensation added by Ben.
I need to know more about signed representation (Two's vs Ones' complement), but can't we map
0 - 255
to-127 to 128
using shifts ? If we do so,
for unsigned to signed value range conversions, you can just flip the sign bit, for reverse case add min_signed value to signed value to convert to unsigned value range
@jbhateja could you unpack how this would actually work when using dot-product with linear scale corrections?
I would imagine we could switch to an "unsigned byte" comparison and thus handle the bit flipping directly in SIMD, but this would introduce a new vector comparison that needs to be added.
I am not sure how we can just "flip the bit" while maintain distance comparisons for euclidean & dotProduct without introducing error or slowing things down significantly.
@benwtrent Apologies for the late response! I am traveling (and marveling 2000 year old pyramids) right now. The transformation you wrote indeed matches mine. Thinking about the bucketing correction meanwhile, will post shortly. Thanks!
@MilindShyani no worries dude! Enjoy your vacation and don't stress about this, it will still be here when you get back to the office :)
Hello!
So I just did the calculation and I get the following
1) I get an innocuous sign discrepancy, i.e., the term - SIGNED_CORRECTION * alpha
should instead be +SIGNED_CORRECTION * alpha
I will double check this.
2) I get an extra term. This is rather serious since it would also affect the unsigned math. My correction looks like
(dx - dxq)*(dxq + \alpha * c + \alpha * min_quantile)
The last term is the extra term, which is present even when c = Signed_correction = 0. What do you think of the extra term @benwtrent ?
@MilindShyani OK, I did some more benchmarking.
I tried switching to +
& with your full correction term and recall significantly dropped to 0.518
I tried the follow permutations and they were all the same recall of 0.877
.
(dx - dxq) * (dxq - SIGNED_CORRECTION * alpha)
(dx - dxq) * (dxq - SIGNED_CORRECTION * alpha - alpha * minQuantile)
(dx - dxq) * (dxq - SIGNED_CORRECTION * alpha + alpha * minQuantile)
This makes me think the additional term of alpha * minQuantize
is bringing nothing to the table, even for non-unit vectors (where I would expect quantiles & alpha to be more extreme).
Hi @benwtrent, I am trying to carefully check the math vs code. The first thing I noticed is that my definitions of dxs and dxq don't necessarily match with yours. That might be the origin for the sign error. I will uniformize the notation and get back.
But meanwhile, in the latest code https://github.com/apache/lucene/compare/main...benwtrent:fix-8-bit?expand=1 I find that
float dxs = scale * dxc;
but scale is defined as 127/max-min. Shouldn't the 127 actually be 255? FYI That's what I used in my transformation, i.e., first, map the float to 0 to 255 and then shift by 128.
@MilindShyani that is just a bad comment. All that will have to be cleaned up. The 127 is when the int7 is the default.
You can see in the ctor, we use the provided bits to create the appropriate number.
The math is quite simple as we saw above, but the code (I guess because its trying to do 7 and 8 bits at the same time) is giving me a really hard time 😅
How is the quadratic term being calculated? Is it
dx_q * dy_q
using the new definition of _q
variables or using the old?
Recall that new definition is dx_q = Math.round(dx*scale - c) * alpha
while the old definition was dx_q = Math.round(dx*scale) * alpha
In any case, you have worked with the code far longer than I have, so if you are confident about it please go ahead and commit :)
In any case, you have worked with the code far longer than I have, so if you are confident about it please go ahead and commit :)
@naveentatikonda could you test my branch as well? I want to make sure my numbers were not flukes on my end.
We should also test with different data sets & different similarity metrics. We should verify that euclidean is also fixed, etc.
@naveentatikonda could you test my branch as well? I want to make sure my numbers were not flukes on my end.
@benwtrent Apologies for my late response. I wasn't feeling well last week. Will run the tests and share the results this week.
@benwtrent I ran some tests with changes in your branch for 8 bits and the recall for L2 is almost same as what you got. But, recall for innerproduct and cosinesimilarity spacetypes are still bad.
SpaceType | Dataset | Recall |
---|---|---|
L2 | Cohere | 0.86 |
InnerProduct | Cohere | 0.0026 |
Cosine | glove-200 | 0.0037 |
Besides this, I observed another recall issue with 4 bits for cosinesimilarity when tested with glove-200 dataset.
Bits | Compress | Recall |
---|---|---|
7 | False | 0.72 |
4 | False | 0.107 |
4 | True | 0.089 |
@naveentatikonda cohere v2 (768 dims)?
I was getting decent recall with dot product over various datasets. Maybe I have a data set sampling bias.
I will try the other you mentioned
I just noticed that I might not have pushed up my branch. But I will rerun my tests to verify:
https://github.com/apache/lucene/compare/main...benwtrent:lucene:fix-8-bit
As for int4
, I wouldn't expect it to have very good recall with any statically set confidence interval. But I can attempt to replicate your results.
Also, there should be NO difference in recall between compressed or not. Something weird is happening there.
@naveentatikonda cohere v2 (768 dims)?
yes
I was getting decent recall with dot product over various datasets. Maybe I have a data set sampling bias.
Inner Product - cohere-wiki-1m L2 - Cohere-wiki. But, with 475K vectors cosine - GloVe-200-Angular
@naveentatikonda AH, I see what I did, I pushed one of my experiments to that branch not an actual good change. Sorry for the false alarm. i will correct asap.
I just noticed that I might not have pushed up my branch. But I will rerun my tests to verify:
I can test it again with your changes once you push it.
As for
int4
, I wouldn't expect it to have very good recall with any statically set confidence interval. But I can attempt to replicate your results.
No, it was tested with dynamic confidence interval. Just now I reran for 4 bits without compress using 0 as confidence interval and the recall is still just 0.13.
Also, there should be NO difference in recall between compressed or not. Something weird is happening there.
Oh I see...let me test them again
@naveentatikonda AH, I see what I did, I pushed one of my experiments to that branch not an actual good change. Sorry for the false alarm. i will correct asap.
No worries, pls let me know once they are ready. I will test them again.
@naveentatikonda OK, I have pushed to https://github.com/apache/lucene/compare/main...benwtrent:lucene:fix-8-bit?expand=1
I am getting similar recall to int7
, but honestly, it doesn't seem better at all and at times worse. It just doesn't seem worth the complexity from a recall perspective unless we can figure out why recall isn't significantly better.
@naveentatikonda I opened an issue for the int4 & glove200. Interesting to be sure. I wonder if we are suffering because its a statistical based model, or if its just due to the lower dimension count: https://github.com/apache/lucene/issues/13614
One interesting finding, is statically setting the confidence interval very low (lower than is currently allowed in Lucene) makes recall way better.
FWIW, this is the opposite of what we found from transformer based models, where the dynamic interval was almost a necessity.
I am getting similar recall to
int7
, but honestly, it doesn't seem better at all and at times worse. It just doesn't seem worth the complexity from a recall perspective unless we can figure out why recall isn't significantly better.
Makes sense. But, have you tested with other metric types and is there any improvement in recall for innerproduct or cosine ?
Description
Based on some of the benchmarking tests that I ran from OpenSearch, there is a significant drop in recall ( appx. 0.03) for 8 bits irrespective of space type, confidence interval or few other parameters. For the same configuration, the recall for 7 bits is atleast greater than 0.85.
Root Cause
As part of quantization, after normalizing each dimension of the vector into [0 to 2^bits - 1] range, we are casting it into byte to bring it into byte range of [-128 to 127]. For 7 bits, we are normalizing each value into 0 to 127 which is already in the byte range. So, there is no rotation or shifting of data. But, for 8 bits any vector dimension which is within 128 to 255 range after normalization, the sign and magnitude changes when it is type casted into byte which leads to a non-uniform shifting or distribution of data. As per my understanding, this is the potential root cause for this huge drop in recall.
To validate this, I have updated the quantization code and tested it against L2 space type by linearly shifting (subtracting 128) each dimension after normalizing it into 0 to 255 range such that each dimension is uniformly distributed and within the byte range(finally round it and clip it to handle edge cases) of -128 to 127. With these changes, we get a min. recall of 0.86 for the same configuration.
Note - The below pseudo code is not a fix and it is a different quantization technique used to validate the root cause. This works only for L2 spacetype because L2 is shift invariant but other spacetypes like cosinesimil and inner product are not shift invariant.
@benwtrent @mikemccand Can you please take a look and confirm if you see this issue when tested with lucene-util?