apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.45k stars 975 forks source link

Deprecate COSINE VectorSimilarity function #13308

Closed Pulkitg64 closed 3 weeks ago

Pulkitg64 commented 2 months ago

Description

Tries to solve #13281 by deprecating COSINE VectorSimilarity function for Vector Search.

Pulkitg64 commented 2 months ago

TestInt8HnswBackwardsCompatibility.java and TestBasicBackwardsCompatibility test cases failing for Lucene versions older than LUCENE_10_0_0. For older Lucene version, we try to read Index directory from resources folder. For example for Lucene-9.10.0, the directory path is located at lucene/backward-codecs/src/test/org/apache/lucene/backward_index/index.9.10.0-cfs.zip.

As per my understanding the segment present at above directory already has KNNVectorField with MAXIMUM_INNER_PRODUCT as vector similarity function and in above test we try to change similarity function to DOT_PRODUCT due to which there is conflict happening when adding more docs to it and hence these test cases failing.

jpountz commented 2 months ago

Oh okay, that means we need to remove support for COSINE just from indexing side not from searching side?

Correct. Practically, this means keeping the enum constant but marking it as deprecated, and modifying IndexingChain to fail using COSINE on 10.x Lucene indexes.

It would be nice to also add a new file format under lucene/codecs that does cosine similarity, like we just added for hamming distance. This could help with the migration for some users.

Pulkitg64 commented 2 months ago

Thanks @jpountz for confirming and suggestions.

Pulkitg64 commented 1 month ago

Thanks @benwtrent for all the feedback. I have raised another revision just to mark the COSINE function as deprecated. I didn't find any internal usages of this function. In a followup PR, I will remove COSINE from writes for LUCENE version >=10.0.0

benwtrent commented 1 month ago

I didn't find any internal usages of this function.

That doesn't make sense to me. IntelliJ tells me there are over 30 usages of VectorSimilarityFunction.COSINE.

Pulkitg64 commented 1 month ago

I didn't find any internal usages of this function.

That doesn't make sense to me. IntelliJ tells me there are over 30 usages of VectorSimilarityFunction.COSINE.

Sorry I think I am missing something, I think we cannot remove COSINE from those places, since we require them in LUCENE version < 10.0. They are mostly being used in FieldInfosFormat, VectorReader, ScalarQuantizedVectorWriter classes.

I think the main task is to first get rid of dependency on enum for VectorSimilarityFunction and instead have some interface to get those functions and support.

benwtrent commented 1 month ago

@Pulkitg64 could you deprecate also VectorUtil.cosine? That is technically a public API.

Pulkitg64 commented 1 month ago

Hi @benwtrent, could we merge this change, if everything looks good to you?

benwtrent commented 3 weeks ago

@Pulkitg64 sorry for radio silence, I was out of office and swamped with other things. If you could move the change log to 9.12, I can merge and backport.

Pulkitg64 commented 3 weeks ago

I somehow messed up with my branch while pulling changes from main branch. @benwtrent I have created a new PR for marking the cosine VectorSImilarity function as deprecated.

https://github.com/apache/lucene/pull/13473

Sorry for the mess!