apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.67k stars 1.03k forks source link

Remove VectorValues.SimilarityFunction.NONE [LUCENE-10015] #11054

Closed asfimport closed 3 years ago

asfimport commented 3 years ago

This stuff is HNSW-implementation specific. It can be moved to a codec parameter.

The NONE option should be removed: it just makes the codec more complex.


Migrated from LUCENE-10015 by Robert Muir (@rmuir), resolved Jul 22 2021 Pull requests: https://github.com/apache/lucene/pull/219, https://github.com/apache/lucene/pull/223

asfimport commented 3 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

+1 I was hoping we could remove NONE as well and make the iterator over vector values only relevant for merging, not searching, so that it doesn't have to worry about indexing skip lists to make advance() efficient.

asfimport commented 3 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

I'm fine with removing NONE but I don't think that would eliminate the need for indexing skip lists since you could still have a document that has no vector value for a field?

asfimport commented 3 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

In this case, nextDoc would hopefully return the single matching doc without linearly checking every doc. When I mentioned skip lists, I was thinking of how advance(int target) allows skipping over many documents efficiently.

asfimport commented 3 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

I was initially hesitant about removing the option for "NONE", since it gives a convenient way to store/ load vectors when you just need them for ranking. Otherwise many users could be writing the same encoding logic to store vectors in binary doc values. But Adrien mentioned an idea that I found helpful: if there's future feedback that it would be useful, we could consider adding convenience utilities/ wrappers for doc values to make this easy.

asfimport commented 3 years ago

ASF subversion and git services (migrated from JIRA)

Commit 28ba8b77975fc1b5a1a07da373916a2b21ea09aa in lucene's branch refs/heads/main from Adrien Grand https://gitbox.apache.org/repos/asf?p=lucene.git;h=28ba8b7

LUCENE-10015: Remove VectorSimilarityFunction#NONE. (#219)

asfimport commented 3 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I merged the removal of NONE.

I'm not familiar enough with vectors use-cases to have an opinion on whether the similarity function should be configurable easily, or whether we'd be good with always using euclidean distance, or whether this should be a codec parameter. [\~sokolov][\~julietibs] Do you have an opinion on this?

asfimport commented 3 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

For what it's worth - I don't have strong experience in this field, so it's just my intuition - I suppose "casual users" are not likely to care about what similarity function (distance function) is used and the choice of it wouldn't have a large effect on the final search results; I think we don't necessarily need to expose it on the Document or Field level API. Meanwhile, "advanced users" (e.g. researchers or algorithm engineers) who have the skills to modify the code and build Lucene on their own may want to tune it. For such advanced use-cases, it might be good to make it configurable via the Format constructor?

asfimport commented 3 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Thanks Tomoko. I opened a PR to see how it played out and it looks like making the similarity function a codec parameter works rather nicely. We're losing the ability to validate that all segments have the same similarity function, but it's probably ok as changing the similarity function is a super expert option?

asfimport commented 3 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

I think it makes sense to keep the ability to configure the similarity function at the field level. I don't see it as a very 'expert' option – based on what the vectors represent and how they've been processed, it's necessary to use the right similarity function to obtain good results. Also unlike 'maxConn' and 'beamWidth' (which were specific to our HNSW implementation), it's a concept that makes sense across NN algorithms generally. To the best of my knowledge, many NN algorithms can handle the full set of common similarity functions (Euclidean, dot product, cosine).

In case it's helpful context: currently we only support Euclidean and cosine distance, which is technically redundant. For cosine similarity, users could normalize the vectors to unit length and use Euclidean. But I'm assuming we'll add support for inner product too, which seems very popular and cannot be expressed in terms of Euclidean distance. The FAISS library currently supports only Euclidean distance and inner product.

asfimport commented 3 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

If there's agreement we should keep the similarity function, then let's resolve this issue as the other thing we discussed, the removal of NONE, has already been merged?

asfimport commented 3 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

I am fine with the current field API, sorry for the disturbance. Besides whether the similarity function should be configurable by users, I agree with it is surely search-algorithm independent feature - most ann/knn algorithms are distance-measure agnostic in my understanding (maybe Humming distance would be also worth supported).

asfimport commented 3 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Yes, let's keep the similarity function! It's an essential component of a metric space (or non-metric space), and is implicit in the vectors: you can't just use any arbitrary function, as Julie says.

 

It makes my life a little more complicated that we removed NONE since it simplifies schema management at the next level. You can define a schema with fields of type vector, and then separately say whether you want to support KNN search for such a field. We used to support such vectors with BinaryDocValues, and then switched to this impl. Now we will have to switch back. But I guess we are now saying these fields are only to be be used for NN search, so it makes sense, and I won't block it.

asfimport commented 3 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

In case it's helpful context: currently we only support Euclidean and cosine distance, which is technically redundant

We actually only support Euclidean and dot product! Sorry if this caused confusion, I have no idea why I thought we implemented cosine instead.

asfimport commented 2 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Closing after the 9.0.0 release