apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.45k stars 975 forks source link

Multi-value Support for KnnVectorField #12313

Open alessandrobenedetti opened 1 year ago

alessandrobenedetti commented 1 year ago

Description

It would be nice to support multiple values in a Knn vector field. This must be compatible with both the Exact and Approximate Nearest Neighbor search.

There are two sides to the coin:

1) Index time support - allowing to add in the indexing data structures multiple vectors for the same field and docID 2) Query time support - how to retrieve a topK list of documents, where each document my have multiple neighbors to the query

The problem is more complicated than it seems.

An initial tentative design and draft implementation is attached

msokolov commented 1 year ago

I see a lot of good work on the implementation in the attached PR, great! What I'm lacking though is any understanding of what the use cases for this might be. Do we have some? I think it's important to have at least some envisioned so we can know how to go with the implementation since there will undoubtedly be tradeoffs.

HoustonPutman commented 1 year ago

@alessandrobenedetti's Berlin Buzzwords talk gave a pretty good example. If you want to have individual vectors for each paragraph, then you would either need to split your document up per-paragraph, or you would need a multi-valued field for your paragraph vectors.

But I can imagine other usages as well. There are users who store lots of information for people in each document to represent a group. So names, emails, etc. If you have vectorized each person for personalization, then that document would also need a vector per person.

benwtrent commented 1 year ago

There are also late-interaction-models that do embeddings per token. While the current HNSW codec wouldn't be best for that, it is another use case for multiple embeddings per document.

uschindler commented 1 year ago

I have a customer using Solr to do similarity search for trademark images. Each trademark has several images and they want to find the trademark with closest image match (cosine distance). They currently use some outdated homemade plugin using BinaryDocvalues and linearily scan through them (LireSolr modified and fixed), but to speed up search, kNN looks like right choice. But to do facetting they want hits counted on the trademarks and not on the images.

benwtrent commented 1 year ago

So, I have been thinking of the current implementation and was wondering if we could instead move towards using the join functionality?

Just to make sure I am not absolutely crazy.

uschindler commented 1 year ago

I would still prefer to have multiple values per document. From the point of view of implementation this does not look crazy to me, but using blockjoins adds too many limitations and often people don't want to use it for other reasons

The implementation as suggested by @alessandrobenedetti looks great to me and goes in line with other multivalued fields in Lucene, just my comments after watching his talk and skimming through th PR:

I think the biggest problem of the current PR is that ordinals need to be "long" as the number of vectors may go beyond Integer.MAX_VALUE.

alessandrobenedetti commented 1 year ago

I'll have to spend more brain time on the proposed block-join alternative, but isn't it already "available" in that form? (with the consequent problems and benefits of joins?) When I have more time I'll give a more informed opinion!

benwtrent commented 1 year ago

I'll have to spend more brain time on the proposed block-join alternative, but isn't it already "available" in that form? (with the consequent problems and benefits of joins?)

The key issue is document collection. Right now, the topK is limited to only topK children documents. Really, what you want is the topK parent documents based on children scores.

david-sitsky commented 7 months ago

The key issue is document collection. Right now, the topK is limited to only topK children documents. Really, what you want is the topK parent documents based on children scores.

Elasticsearch seem to support the idea of using the "block join" approach as outlined in this blog post from a couple of weeks ago: https://www.elastic.co/search-labs/blog/articles/chunking-via-ingest-pipelines, although from what I can see, it will suffer from the same issue @benwtrent mentions where the topK will be applied to child documents.

edit: I've just seen the PR which seems to address this?

alessandrobenedetti commented 7 months ago

Hi @david-sitsky, the multi-valued vectors in Lucene's contribution is now paused for lack of fundings. I'll resume it from my side if and when I get some sponsors :)

The nested documents approach on the other hand has been released with Lucene 9.8! You read the various considerations that apply in the thread!

vigyasharma commented 2 months ago

What are some use-cases for multi-valued vectors that are not easily supported using parent-child block joins?

I'd like to contribute here, trying to understand our main requirements given we now have parent-child joins in knn. I suppose block joins require all child documents with each update. Is that the main overhead we'd like to avoid with multi-valued vectors? Are there query time scenarios that block joins cannot fulfill?

alessandrobenedetti commented 2 months ago

Hi, I gave a talk about this at Berlin Buzzwords where I touched on the motivations: https://www.youtube.com/watch?v=KhL0NrGj0uE In short:

benwtrent commented 2 months ago

I do think things like ColBERT would benefit from having multiple vectors for a single document field.

One crazy idea I had (others have probably already thought of this, and found it wanting...) is since HNSW supports non-euclidean space already, what if HNSW graph nodes simply represented more than one vector?

Then the flat storage system and underlying scorer could handle the distance computations and HNSW itself doesn't actually have to change.

I could see this maybe hurting recall, but I wonder in practice how bad it would actually hurt things.

The idea would be:

HNSW doesn't actually look at the vectors at all, it simply provides an ordinal and requests a score, so the change in regards to code wouldn't be too bad I think.

vigyasharma commented 2 months ago

@benwtrent : ++, I've been thinking on similar lines in the context of e-commerce type applications where different vectors represent different aspects of a document. The scorer can do a weighted dot-product across different vectors.

I like the wider generalization here, of using max/min/sum/... aggregations. Another thing to consider is that the multi-vector scorer will also need to handle similarity computations between nodes during graph build. For e.g. if the aggregation is max, would we need to compute distance between n x n vectors and then take the max?

benwtrent commented 2 months ago

if the aggregation is max, would we need to compute distance between n x n vectors and then take the max?

Correct, I would even want flexibility between what was used to build the graph vs. how we query it.

This may be a good enough trade-off against recall. I would hope that adding additional connections in the graph would off-set any recall differences we would see when combining vectors in a single node vs. each vector being its own node. All this requires testing and prototyping, I just haven't had the time to dig deeply.

krickert commented 1 month ago

I was thinking about this and thought this would be cool with a few different use cases for a multi-valued vector:

  1. The multi-values are treated the same as the single value, except once it's found to be a nearest K, it won't repeat. For example: Doc A has vectors A1, A2, and A3. Doc2 has vectors B1 bad B2. Then we have a Doc3 with C1. A vector search is performed, and the K'th nearest return: A1 A2 C1 B2 B1 A3

In one scenerio, the search results would be the same as above, and the docs would repeat.

In another scenario, the results would just return the top doc and not repeat it. So a KNN result would be: Doc1 (A1 won) Doc3 (C1 won) Doc2 (B2 won)

...

In another option, we can look into indexing the vectors where we get an average, min, or max between each dimension and just index the avg, min, or max. For some reason, I think this might be a bit weird since you can do these calculations at index time. But just a thought...

Are any of the suggestions similar to what I'm suggesting?

vigyasharma commented 1 month ago

In another scenario, the results would just return the top doc and not repeat it.

I believe this is what the parent-block join implementation for vector values does currently. Collected vector values are deduped within the DiversifyingNearestChildrenKnnCollector, and we only keep the top scoring doc per parent.

In one scenario, the search results would be the same as above, and the docs would repeat.

So the idea is that query returns the same document multiple times, with different scores? I'd worry that we lose top-k slots to same document scores, which we'd probably want to aggregate per document anyway before using (similar to the parent-block join approach).

benwtrent commented 1 month ago

@vigyasharma @krickert There are a couple of ways (at least to me) to implement this natively in Lucene (without the join block restrictions).

  1. Have each individual vector be a connection in the graph with some resolution back to the original doc. One concern I have with this is that vector ordinals will now have to be stored as long values, effectively doubling heap requirements and increasing off-heap storage
  2. Have documents be the vertices in the graph and consider only the max-sim other docs for the connections. My concern here is recall. It could be that the graph is too sparse, as now passages don't actually have maxConn connections.
  3. Have documents be vertices, but ensure that there are connections relevant for each passage (would require adjustments to the HNSW builder). My concern here would be the complexity in the graph builder. It doesn't seem insurmountable, but this would alleviate the recall concern in point 2.

All this also is predicated on a nice iterator API. I would expect us to have to update Float|ByteVectorValues to have a new "isMultiVector" or something and adjust the iterators so that you can iterate by doc, and then iterate all the vectors in a doc.

vigyasharma commented 1 month ago

@benwtrent I like the idea of having documents be vertices in the graph, with an API that let's you iterate/access the different vectors per doc. It would have an indexing time similarity used during graph build (e.g. max-sim), and could allow for a different similarity to be used at query time (e.g. weighted dot-product across different vectors). I suppose this is option 2 you describe above?

Not sure I understand option 3. Are you thinking that graph has different types of edges b/w documents based on diff. similarity functions? So if you were using max similarity you would follow one path, but if you wanted avg. similarity, you might use a different set of edges?

I started a prototype that's close to Option-2, will try to get it to a shareable state soon.

benwtrent commented 1 month ago

Not sure I understand option 3. Are you thinking that graph has different types of edges b/w documents based on diff. similarity functions? So if you were using max similarity you would follow one path, but if you wanted avg. similarity, you might use a different set of edges?

Option 3 is the same as option 2, except on graph building we ensure the connectivity between the vertices (still docs with multi-vectors) reflects the connections that would have been built if each vector was vertex. This would significantly increase document connections and could slow down search/index times but would (conceivably) improve recall.

Maybe option 2 is enough, and recall & speed tradeoff is still good enough given our current parameters (maxConn & beamwidth).

vigyasharma commented 4 days ago

Raised https://github.com/apache/lucene/pull/13525 with the approach we discussed above. We retain 1:1 graph node ordinal to document mapping, and define a new FlatVector reader, writer and scorer to work with multiple vector values per doc.

This is a big change and it's an early PR to get feedback on the overall approach (tests and some cleanup pending). If we agree with the idea, I'm happy to reraise separate split PRs with different changes.

would love to get your thoughts on this @benwtrent , @msokolov , @alessandrobenedetti