Allow representing a document with multiple embeddings (dense vectors)

joshdevins commented 3 years ago

Currently the dense_vector field is a single-valued field. This is a limitation that forces a document to be repeated or split up into multiple documents when it's necessary to have multiple embeddings represent an entire document. This can be cumbersome and introduces either duplication of data or complexity for the application indexing documents and embeddings.

A common scenario for this is when using embeddings to retrieve or rerank documents that have first been split into passages [1]. Each embedding is a representation of a passage (of roughly paragraph length) and document ranking can use, for example, the score of the best matching passage. Other approaches (ColBERT [2]) represent text using a bag of term embeddings, in which case a passage itself is represented by multiple embeddings.

Some initial ideas to improve this:

A multi-valued dense_vector field.
Perhaps like with ranking_features, another field type that supports n vectors/embeddings — dense_vectors
A matrix field type, since embeddings for a document share the same dimensionality. This introduces the possibility to also perform matrix operations between documents or between a static/query matrix and a document matrix for ranking tasks. An alternative to this would be to support tensor of 1, 2, 3 dimensions (for example) which is likely more appropriate than a matrix.

[1] Pretrained Transformers for Text Ranking: BERT and Beyond, Section 3.3 Multi-Stage Ranking Architectures — From Passage to Document Ranking [2] ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

elasticmachine commented 3 years ago

Pinging @elastic/es-search (Team:Search)

madisonb commented 2 years ago

Our team would love to have this feature as well. Our use case is similar in that we have a document with an array of images that we want to turn into vectors to search through them. We lose all of the other information about the document if we split the images into their own index, but would love to support a simple use case like so:

# multi-image tweet
{
  "body content": "<tweet content>",
  "author": "me",
  ...
  "images": [
    {
      "url": <image_link>,
      "vec": <image vector>
    },
    {...},
    {...}
  ]
}

Given the vast range of other data types that support multi-valued fields, it was surprising to run into such a limitation.

S-Dragon0302 commented 1 year ago

@mayya-sharipova Is there any plan to launch this function in which version of elasticsearch

benwtrent commented 1 year ago

@S-Dragon0302 for https://github.com/elastic/elasticsearch/issues/72068 I am digging into it. No release version yet.

Could you mention your use case here and what your goals are with using this feature?

S-Dragon0302 commented 1 year ago

@S-Dragon0302 for #72068 I am digging into it. No release version yet.

Could you mention your use case here and what your goals are with using this feature?

Basically, a message published by a social platform contains multiple images. If I need to find this message according to the image, I need to compare it with each image of this message. As long as there is a similar image, this message will be hit

S-Dragon0302 commented 1 year ago

ColeDCrawford commented 1 year ago

Yeah the above chart would be ideal! I also have a use case for this. I'm interested in hybrid search that combines the new learned sparse encoder for text with vector based image search. I have a database of artwork with text metadata and multiple images per artwork. For example: https://harvardartmuseums.org/collections/object/170157. One work, but 1..N images.

karmi commented 1 year ago

Hello @benwtrent, any news about the progress of the feature? This would be really useful, as many use cases need to split documents into sentences/paragraphs for embedding...

benwtrent commented 1 year ago

@karmi we are currently implementing a way to utilize nested fields for multiple dense_vector values via Lucene: https://github.com/apache/lucene/pull/12434. This way we have the ability to have "passage vectors" (passage & dense_vector) stored at the same level allowing for effective retrieval augmented generation.

We aim to make this seamless and simple to use in Elasticsearch once we have it in Lucene.

karmi commented 1 year ago

Thanks for the update, @benwtrent! This looks promising!

saiparsa commented 10 months ago

@benwtrent, thank you so much for your contributions to this project! Just checking in to see if there's an update or if you might have an idea of when you'll have a chance to look at this further. Thanks again!

benwtrent commented 10 months ago

@saiparsa see: https://github.com/elastic/elasticsearch/pull/99763 & https://www.elastic.co/blog/adding-passage-vector-search-to-lucene

Merged into the 8.11 branch. I hope to have some things written up about how and when to use it in Elasticsearch.

It is not the final word on using multiple vectors in a single document, but its a step in the right direction.

We went nested first as it fit nicely with Lucene & fit naturally in the "passage vector" search case.

The other usecase like ColBERT, will still need some work.

dkarlovi commented 10 months ago

@benwtrent you're saying ES will support dense vectors in type: nested starting with 8.11? If you get some docs started, please ping this issue too, thank you! :1st_place_medal:

benwtrent commented 10 months ago

@dkarlovi https://www.elastic.co/guide/en/elasticsearch/reference/8.11/knn-search.html#nested-knn-search

Note 8.11 is unreleased. Until release, this is subject to change. But this should give you an idea.

serenachou commented 8 months ago

Shouldn't we close this out? @julio-santana @benwtrent as done?

benwtrent commented 8 months ago

I am not sure. We support nested dense vectors, but I don't think that is the end of the work.

We should still support multiple vectors in the field directly, without nesting.

giannik commented 7 months ago

So if i understand correctly with this approach we can create per document a list of passages (paragraphs) each with its own embedding (using a service like openai embedding model) and then store them in the existing document as nested objects without the need to create multiple separate documents per embedding. Correct ? I was also looking for this kind of solution for an existing elasticsearch implementation to simply extend the existing document with nested embeddings. Are there any drawbacks to be aware of conserning performance or search quality ?

benwtrent commented 7 months ago

So if i understand correctly with this approach we can create per document a list of passages (paragraphs) each with its own embedding (using a service like openai embedding model) and then store them in the existing document as nested objects without the need to create multiple separate documents per embedding. Correct ?

Correct.

Are there any drawbacks to be aware of conserning performance or search quality ?

For vector search, it will diversify the nearest passages by their parent documents. This will mean we will keep searching until we find k nearest documents, not just passages. This will cause more exploration and additional vector comparisons.

Additionally, with having nested fields, everything in the index becomes "sparse", meaning not every document contains every field. So, there are certain optimizations (for term queries, specific aggregations, etc.) we cannot apply. However, this depends on a case by case basis if this will actually effect you.

dkarlovi commented 7 months ago

@benwtrent

with having nested fields, everything in the index becomes "sparse", meaning not every document contains every field. So, there are certain optimizations (for term queries, specific aggregations, etc.) we cannot apply. However, this depends on a case by case basis if this will actually effect you.

Would that mean you'd suggest keeping this data in a separate index since, IIUC, it could affect even queries not using KNN?

benwtrent commented 7 months ago

Would that mean you'd suggest keeping this data in a separate index since, IIUC, it could affect even queries not using KNN?

Not necessarily. One of the biggest benefits is that metadata for docs & other fields are all queriable at the same time as kNN. It just depends on what you are searching and what your goals are.

ColeDCrawford commented 7 months ago

The 8.11 kNN docs mention that

inner_hits for kNN will only ever return a single hit, the nearest passage vector. Setting "size" to any value greater than 1 will have no effect on the results.

That's a fairly big downside for me and maybe a reason I would skip nesting entirely. It would be great to set up paragraph-level or token-length chunking for an article or even a book and be able to get not only the relevance of the top-level work but the most relevant passages. A very relevant work could have many such passages.

benwtrent commented 7 months ago

@ColeDCrawford what do you think of this: https://github.com/elastic/elasticsearch/pull/104006

Approximate kNN search for gathering the nearest docs is by nearest passage
We then can gather and score inner hits from those nearest docs

ColeDCrawford commented 7 months ago

@benwtrent that seems promising! I have a project that I think would benefit greatly from this. It's corpus of a few hundred thousand novels. Right now we are storing a ton of duplicated metadata because we chunked the novels manually into millions of chunks, and treated those passages as our ES documents. But it would make a lot of sense to use this so we can maintain the metadata at that top level and let users either find the most relevant passage or the most relevant book.

mbarretta commented 3 months ago

@madisonb Does the combo of https://github.com/elastic/elasticsearch/pull/99763 and https://github.com/elastic/elasticsearch/pull/104006 meet your need?

madisonb commented 3 months ago

The original request of mine stemmed from mirroring functionality that I typically see with other field types. The teams I currently work on and have worked on in the past do not use nested fields due to both performance and the potential massive bloat of having a large amount of nested docs for a single doc.

When parsing both text and images, I tend to shy away from leveraging nested fields due to incompatibility with visualizations in kibana, performance, and potential for storing duplicative data twice depending on the use case.

It would be much simpler if vector fields were first-class citizens and could be represented as independent elements of an array, arrays of objects, or other basic use cases (the matrix use case I think is a bit of stretch but I do understand it).

mbarretta commented 3 months ago

@giladgal ^^

elasticsearchmachine commented 1 month ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)

elastic / elasticsearch

Allow representing a document with multiple embeddings (dense vectors) #72068