Open alessandrobenedetti opened 1 year ago
I see a lot of good work on the implementation in the attached PR, great! What I'm lacking though is any understanding of what the use cases for this might be. Do we have some? I think it's important to have at least some envisioned so we can know how to go with the implementation since there will undoubtedly be tradeoffs.
@alessandrobenedetti's Berlin Buzzwords talk gave a pretty good example. If you want to have individual vectors for each paragraph, then you would either need to split your document up per-paragraph, or you would need a multi-valued field for your paragraph vectors.
But I can imagine other usages as well. There are users who store lots of information for people in each document to represent a group. So names, emails, etc. If you have vectorized each person for personalization, then that document would also need a vector per person.
There are also late-interaction-models that do embeddings per token. While the current HNSW codec wouldn't be best for that, it is another use case for multiple embeddings per document.
I have a customer using Solr to do similarity search for trademark images. Each trademark has several images and they want to find the trademark with closest image match (cosine distance). They currently use some outdated homemade plugin using BinaryDocvalues and linearily scan through them (LireSolr modified and fixed), but to speed up search, kNN looks like right choice. But to do facetting they want hits counted on the trademarks and not on the images.
So, I have been thinking of the current implementation and was wondering if we could instead move towards using the join
functionality?
Just to make sure I am not absolutely crazy.
join
already requires children and parent documents to be indexed next to each other (parent docs as the last doc in the child&parent block).NeighborQueue
that requires topK parent documents (ParentNeighboQueue?). This queue would require a BitSet
of the parent doc ids. Then when a child doc is visited via the graph, we can check its score and determine the parent block via BitSet#nextSetBit
. Tracking how many parents we have visited and their scores wouldn't be too bad from that avenue.mean
, min
, max
discovered child during the graph explore.I would still prefer to have multiple values per document. From the point of view of implementation this does not look crazy to me, but using blockjoins adds too many limitations and often people don't want to use it for other reasons
The implementation as suggested by @alessandrobenedetti looks great to me and goes in line with other multivalued fields in Lucene, just my comments after watching his talk and skimming through th PR:
I think the biggest problem of the current PR is that ordinals need to be "long" as the number of vectors may go beyond Integer.MAX_VALUE.
I'll have to spend more brain time on the proposed block-join alternative, but isn't it already "available" in that form? (with the consequent problems and benefits of joins?) When I have more time I'll give a more informed opinion!
I'll have to spend more brain time on the proposed block-join alternative, but isn't it already "available" in that form? (with the consequent problems and benefits of joins?)
The key issue is document collection. Right now, the topK
is limited to only topK
children documents. Really, what you want is the topK
parent documents based on children scores.
The key issue is document collection. Right now, the
topK
is limited to onlytopK
children documents. Really, what you want is thetopK
parent documents based on children scores.
Elasticsearch seem to support the idea of using the "block join" approach as outlined in this blog post from a couple of weeks ago: https://www.elastic.co/search-labs/blog/articles/chunking-via-ingest-pipelines, although from what I can see, it will suffer from the same issue @benwtrent mentions where the topK will be applied to child documents.
edit: I've just seen the PR which seems to address this?
Hi @david-sitsky, the multi-valued vectors in Lucene's contribution is now paused for lack of fundings. I'll resume it from my side if and when I get some sponsors :)
The nested documents approach on the other hand has been released with Lucene 9.8! You read the various considerations that apply in the thread!
What are some use-cases for multi-valued vectors that are not easily supported using parent-child block joins?
I'd like to contribute here, trying to understand our main requirements given we now have parent-child joins in knn. I suppose block joins require all child documents with each update. Is that the main overhead we'd like to avoid with multi-valued vectors? Are there query time scenarios that block joins cannot fulfill?
Hi, I gave a talk about this at Berlin Buzzwords where I touched on the motivations: https://www.youtube.com/watch?v=KhL0NrGj0uE In short:
I do think things like ColBERT
would benefit from having multiple vectors for a single document field.
One crazy idea I had (others have probably already thought of this, and found it wanting...) is since HNSW supports non-euclidean space already, what if HNSW graph nodes simply represented more than one vector?
Then the flat storage system and underlying scorer could handle the distance computations and HNSW itself doesn't actually have to change.
I could see this maybe hurting recall, but I wonder in practice how bad it would actually hurt things.
The idea would be:
HNSW doesn't actually look at the vectors at all, it simply provides an ordinal and requests a score, so the change in regards to code wouldn't be too bad I think.
@benwtrent : ++, I've been thinking on similar lines in the context of e-commerce type applications where different vectors represent different aspects of a document. The scorer can do a weighted dot-product across different vectors.
I like the wider generalization here, of using max/min/sum/... aggregations. Another thing to consider is that the multi-vector scorer will also need to handle similarity computations between nodes during graph build. For e.g. if the aggregation is max
, would we need to compute distance between n x n
vectors and then take the max?
if the aggregation is max, would we need to compute distance between n x n vectors and then take the max?
Correct, I would even want flexibility between what was used to build the graph vs. how we query it.
This may be a good enough trade-off against recall. I would hope that adding additional connections in the graph would off-set any recall differences we would see when combining vectors in a single node vs. each vector being its own node. All this requires testing and prototyping, I just haven't had the time to dig deeply.
I was thinking about this and thought this would be cool with a few different use cases for a multi-valued vector:
In one scenerio, the search results would be the same as above, and the docs would repeat.
In another scenario, the results would just return the top doc and not repeat it. So a KNN result would be: Doc1 (A1 won) Doc3 (C1 won) Doc2 (B2 won)
...
In another option, we can look into indexing the vectors where we get an average, min, or max between each dimension and just index the avg, min, or max. For some reason, I think this might be a bit weird since you can do these calculations at index time. But just a thought...
Are any of the suggestions similar to what I'm suggesting?
In another scenario, the results would just return the top doc and not repeat it.
I believe this is what the parent-block join implementation for vector values does currently. Collected vector values are deduped within the DiversifyingNearestChildrenKnnCollector
, and we only keep the top scoring doc per parent.
In one scenario, the search results would be the same as above, and the docs would repeat.
So the idea is that query returns the same document multiple times, with different scores? I'd worry that we lose top-k slots to same document scores, which we'd probably want to aggregate per document anyway before using (similar to the parent-block join approach).
@vigyasharma @krickert There are a couple of ways (at least to me) to implement this natively in Lucene (without the join block restrictions).
long
values, effectively doubling heap requirements and increasing off-heap storagemaxConn
connections.All this also is predicated on a nice iterator API. I would expect us to have to update Float|ByteVectorValues to have a new "isMultiVector" or something and adjust the iterators so that you can iterate by doc, and then iterate all the vectors in a doc.
@benwtrent I like the idea of having documents be vertices in the graph, with an API that let's you iterate/access the different vectors per doc. It would have an indexing time similarity used during graph build (e.g. max-sim), and could allow for a different similarity to be used at query time (e.g. weighted dot-product across different vectors). I suppose this is option 2 you describe above?
Not sure I understand option 3. Are you thinking that graph has different types of edges b/w documents based on diff. similarity functions? So if you were using max similarity you would follow one path, but if you wanted avg. similarity, you might use a different set of edges?
I started a prototype that's close to Option-2, will try to get it to a shareable state soon.
Not sure I understand option 3. Are you thinking that graph has different types of edges b/w documents based on diff. similarity functions? So if you were using max similarity you would follow one path, but if you wanted avg. similarity, you might use a different set of edges?
Option 3 is the same as option 2, except on graph building we ensure the connectivity between the vertices (still docs with multi-vectors) reflects the connections that would have been built if each vector was vertex. This would significantly increase document connections and could slow down search/index times but would (conceivably) improve recall.
Maybe option 2 is enough, and recall & speed tradeoff is still good enough given our current parameters (maxConn & beamwidth).
Raised https://github.com/apache/lucene/pull/13525 with the approach we discussed above. We retain 1:1 graph node ordinal to document mapping, and define a new FlatVector reader, writer and scorer to work with multiple vector values per doc.
This is a big change and it's an early PR to get feedback on the overall approach (tests and some cleanup pending). If we agree with the idea, I'm happy to reraise separate split PRs with different changes.
would love to get your thoughts on this @benwtrent , @msokolov , @alessandrobenedetti
Description
It would be nice to support multiple values in a Knn vector field. This must be compatible with both the Exact and Approximate Nearest Neighbor search.
There are two sides to the coin:
1) Index time support - allowing to add in the indexing data structures multiple vectors for the same field and docID 2) Query time support - how to retrieve a topK list of documents, where each document my have multiple neighbors to the query
The problem is more complicated than it seems.
An initial tentative design and draft implementation is attached