Can we store only quantized vectors to reduce disk footprint?

Rassyan commented 4 days ago

Description

In light of optimizing disk usage for KNN vector searches in Lucene, I propose considering a new KnnVectorsFormat class in Lucene that handles only quantized vectors, eliminating the need to store original float32 vectors. This approach could significantly reduce disk usage, with potential reductions similar to the memory efficiency seen in int8 quantization scenarios, where usage can drop to about 25%. This figure is illustrative, emphasizing that actual savings could vary with different quantization methods and storage configurations.

I seek community feedback on:

The technical feasibility of this new storage model.
Potential impacts on search accuracy and performance.

Your insights will help determine the viability of this approach for enhancing Lucene's vector search capabilities.

benwtrent commented 4 days ago

The technical feasibility of this new storage model.

Very feasible. However, it gets tricky on accuracy and the quantization methods will require updating to handle this. But all the current (and hopefully future) quantization methodologies can easily "re-hydrate" vectors so that iterating floats is still possible (folks give floats, we should be able to return floats if they want them).

Potential impacts on search accuracy and performance.

For higher fidelity quantization methods, I expect accuracy to be OK. I am not 100% sure with how things are now if we would ever want to quantize below int7 for storage. Int4 as it stands (when using the dynamic confidence interval), would suffer greatly without some more significant improvement to the algorithm.

Rassyan commented 4 days ago

Excuse my ignorance, but I was wondering...

quantization methodologies can easily "re-hydrate" vectors so that iterating floats is still possible

Could you elaborate on the computational costs associated with this? If the need to retrieve floats from users is not present, is it feasible to skip this rehydration step and directly use the quantized vectors for distance calculations?

higher fidelity quantization methods

So, would int7 be considered a higher fidelity quantization method? Based on your experience and insights, how would you rate the fidelity of int7, int4, and binary quantization methods? Where do they stand in terms of maintaining accuracy while optimizing storage efficiency?

Has the Lucene community already planned or discussed the implementation of a dedicated KnnVectorsFormat for handling only quantized vectors? Are there quick support mechanisms for users who are willing to compromise on accuracy for significant savings in disk space and do not require the original vectors?

vigyasharma commented 3 days ago

If the need to retrieve floats from users is not present, is it feasible to skip this rehydration step and directly use the quantized vectors for distance calculations?

Quantized vectors are used directly for distance calculations today. We keep the full precision values around for recalculating quantiles during merging. @benwtrent has a nice blog post that explains this – https://www.elastic.co/search-labs/blog/scalar-quantization-in-lucene

It could be useful to enable saving disk space at the cost of slower retrieval of float full precision vectors. Some use-cases might have uniform data distributions anyway.

dungba88 commented 3 days ago

Note that we would also need the original float32 vectors for re-ranking in case it's needed (e.g maybe int4) (see https://github.com/apache/lucene/issues/13564). But if someone prefers less storage and is okay with low recall, then dropping the original vectors at searching side (if the indexing and searching are separated) could be fine. I think we still need for indexing and merging as vigyasharma@ comment.

There are also several related ideas:

benwtrent commented 3 days ago

I think we still need for indexing and merging as vigyasharma@ comment.

I don't know if its strictly necessary to keep the raw vectors for merging. Once a certain limit is reached, especially for vectors that play well with quantization, you can likely throw the vectors away even on the indexing side. It would just require work, and I wouldn't want it to be the default behavior.

I will answer the wall of questions from @Rassyan. But, most of these would be discoverable through exploring the quantization methodology for yourself.

Could you elaborate on the computational costs associated with this?

Its just the inverse of the quantization method. For every vector, you iterate its components inverting the quantization step.

So, would int7 be considered a higher fidelity quantization method?

I would think so, unless you had a different one in mind?

Based on your experience and insights, how would you rate the fidelity of int7, int4, and binary quantization methods?

For pure score correlations, what we have seen:

int7 ~ 95-99%% score correlations
int4 ~ 85-90% score correlations
bit ~ 70-80+ score correlations

There are vectors that are aggressively AGAINST quantization (e.g. GIST & Glove) and perform way worse than all other vectors. But modern transformer architectures perform way better.

Where do they stand in terms of maintaining accuracy while optimizing storage efficiency?

Just do the math ;)

Has the Lucene community already planned or discussed the implementation of a dedicated KnnVectorsFormat for handling only quantized vectors?

There has been discussions around having the floating point removed for "search segments" but keeping them for "index segments"

Are there quick support mechanisms for users who are willing to compromise on accuracy for significant savings in disk space and do not require the original vectors?

I don't even know what this means.

jpountz commented 3 days ago

In my opinion, we should not have lossy codecs. This creates weird situations where the errors could compound in weird ways over time, e.g. when you switch file formats.

I'd rather like it to be done on top of the codec API. E.g. computing a good scalar quantization for a given model offline, and then using it in the way in to index vectors directly as byte[].

mikemccand commented 3 days ago

I'd rather like it to be done on top of the codec API. E.g. computing a good scalar quantization for a given model offline, and then using it in the way in to index vectors directly as byte[].

Lucene's HNSW KNN already supports this -- users can provide either byte[] or float[] input vectors.

So a user could pre-quantize their float[] vectors, with full knowledge of the model that produced those vectors and a "good" quantization approach (instead of Lucene's per-segment quantile based estimation), turn those into byte[] to send to Lucene.

mikemccand commented 3 days ago

Also, note that, at least for the current scalar quantization (int7, int4), those full precision float[] vectors remain on disk during searching. They are only used during indexing (merging) to recompute the quantiles and re-quantize the vectors more accurately (possilby) for that newly merged segment.

For apps using near-real-time segment replication (the best/preferred replication approach IMO), it may be possible eventually to strip out the float[] from the Lucene index sent to searchers, but this has only been discussed so far -- no implementation -- I'll try to track down the issue (@benwtrent also referred to this above).

mikemccand commented 3 days ago

Ahh sorry @dungba88 also referenced the issue above! https://github.com/apache/lucene/issues/13158

benwtrent commented 3 days ago

In my opinion, we should not have lossy codecs. This creates weird situations where the errors could compound in weird ways over time, e.g. when you switch file formats.

I do think we should consider adding support for half-floats. Regardless of them being able to utilize hard-ware accelerated comparisons or not, it would greatly reduce the disk footprint.

The contract with the user for half-floats would be pretty straight forward as they specify that they want to store them as such.

The segment based replication solution is a very interesting one. Though for higher compression ratios (e.g. 1 or 2 bits), you generally need the rescoring. Though, conceivably, the codec could quantize the vectors twice (int7, then to bit) and allows you to rerank bit quantized values with int7 quantized values...quantization all the way down

apache / lucene

Can we store only quantized vectors to reduce disk footprint? #14007

Description