Add Scalar Quantization codec for Vectors

benwtrent commented 1 year ago

Description

Having copy-on-write segments lends itself nicely with quantization. I propose we add a new "scalar" or "linear" quantization codec. This will be a simple quantization codec provided in addition to the existing HNSW codec.

So, why a new codec?

Most users don't know all their vector distributions ahead of time. Lucene's segment structure is a nice fit for evolving quantiles.
Can provide 4x memory reduction footprint for floats.
Search latencies can decrease significantly as comparing scalar bytes vs floats is faster (not to mention the float buffer decoding overhead disappears)

What would be required for the new codec:

New settings for the user to set, the quantile considered when making scalar quantization (absolute min/max, 99th quantile, 90th, etc.)
Quantized vectors will have to be stored
Quantization information will be stored
Originally provided vectors will have to be stored. We need this on segment merges as percentiles could be different between segments early in the index lifecycle. We would expect over time that re-quantization at segment merge will not be required as percentiles will level out.

Index time concerns:

I am not sure we can create the HNSW graph until all vectors are quantized. Some experimentation will have to be done here. It may be that creating the graph in a streaming fashion and then quantizing the vectors later works fine.
Segment merges require re-quantization of raw vectors if quantiles change significantly. This might be tunable. If the quantiles only change by some small factor (1e-5), re-quantization wouldn't be required.

What would happen at search:

For each segment, the query vector will be quantized according to the segment quantile info
Search by default would search over quantized vectors scoring similarities. Potentially scaling the scores to be within non-quantized ranges

Some open questions:

How do we handle byte[] searches & vectors?
- We could quantize those into half-bytes. Or just reject them for now.
Do we want to provide an option to re-score over non-quantized vectors? Or provide a way to access them directly?
How do we handle scoring from similarities?
- Dot-product between non-quantized floats and quantized bytes are very different. Though, I think this is solvable (meaning scores will be within similar ranges for non-quantized and quantized) for this codec.
Would we benefit from having a “quantization limit” ? Meaning a segment isn’t quantized until some limit is reached? To support this, scoring would have to be within similar ranges as you don’t want scoring to be dramatically different between quantized and regular vectors.

Useful resources: https://qdrant.tech/articles/scalar-quantization/ https://zilliz.com/blog/scalar-quantization-and-product-quantization

jimczi commented 1 year ago

I am not sure we can create the HNSW graph until all vectors are quantized. Some experimentation will have to be done here. It may be that creating the graph in a streaming fashion and then quantizing the vectors later works fine.

That appears to be a sound approach. Generally, I believe we can separate the process of constructing the graph from the graph traversal itself. As an example, take DiskANN, which constructs the graph based on the original vectors and subsequently conducts graph traversal for search using quantized vectors via Product Quantization (PQ). This approach seems reasonable because the graph exclusively holds neighbors, enhancing precision due to its origination from the original vectors. During search, it might be suitable to employ an alternate strategy for computing similarity.

The primary challenge I foresee is if we were to apply distinct quantization boundaries for each segment. Merging the similarities resulting from these varied quantizations could be intricate, given their operation on differing scales. Perhaps we could mitigate these disparities by implementing rescoring using the original vectors at a segment level? That might be too costly though. Another possibility would be to compute the boundaries regularly but not on every segment/merge or using half-float instead of bytes so that we can apply the same reduction for all segments?

msokolov commented 1 year ago

Q: have we considered choosing an initial quantization based on the first segment seen and using that for all subsequent segments? Or: providing an API where quantization parameters can be provided? These kinds of approaches would seem to offer increased benefits in that we would not need to store the original vectors (since re-quantization would never be required). Looking at product-quantization schemes based on kmeans clustering this seems to be the usual approach (train the kmeans offline on a subset of the vectors, and then use them for all the vectors).

benwtrent commented 1 year ago

have we considered choosing an initial quantization based on the first segment seen and using that for all subsequent segments?

This may not be possible. What if they flush the first segment only after 10 docs? We need a representative random sample for this to at least work. I am not sure how to make sure that happens.

providing an API where quantization parameters can be provided?

This assumes that the user already knows about their vectors (or at least many of them) and built quantization well ahead of time.

Having it in segments directly allows the data to evolve over time without users having to worry about it.

IMO, this is no different than somebody just doing quantization outside of Lucene and indexing the byte values directly.

These kinds of approaches would seem to offer increased benefits in that we would not need to store the original vectors (since re-quantization would never be required)

I don't think these ideas make any such guarantees. We just don't bother with doing any work for the user. And thus just say "y'all handle it user". Either by making sure they give us a nice representative random sample in the first segment or by just quantizing outside of lucene.

Looking at product-quantization schemes based on kmeans clustering this seems to be the usual approach

Correct, using Lloyd's algorithm. As the dataset becomes stable as more data is seen, Lloyds could run more efficiently as better initial centroids are known from previous runs. It lends itself to distributed work over time.

But, I want to focus on scalar quantization first if possible :).

benwtrent commented 1 year ago

I have done a poor job of linking against the related work for bringing vector lossy-compression.

The initial implementation of adding int8 (really, its int7 because of signed bytes...): https://github.com/apache/lucene/pull/12582

A significant refactor to make adding new quantized storage easier: https://github.com/apache/lucene/pull/12729

benwtrent commented 11 months ago

This was shipped in Lucene 9.9 as the Lucene99HnswScalarQuantizedVectorsFormat codec format!

apache / lucene

Add Scalar Quantization codec for Vectors #12497

Description