elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.12k stars 24.83k forks source link

[Feature Request] Support inner_hits.size > 1 for nested dense vector fields #102950

Closed dbuades closed 9 months ago

dbuades commented 11 months ago

Description

Aa discussed in this PR, the addition of such a feature would prove to be very useful for certain use cases.

For example, let's say that we have a scientific article with 4096 tokens that we split in 8 passages of 512 tokens for embedding purposes. The first passage might correspond to the abstract and the last passage to the conclusion.

Now, a user makes a query that can be answered with that scientific article, and we want to match the passages that are more relevant inside that document. There might be the case that both the abstract and the conclusion are relevant, and thus, we would like to show both.

This is possible by having two separate indexes for passages and articles, but if we use the new search on nested dense_vector fields introduced in Elastic 8.11 and only have a single article index, this is no longer possible. The reason being that inner_hits.size == 1 limits ourselves to one passage per article.

CC @benwtrent

elasticsearchmachine commented 11 months ago

Pinging @elastic/es-search (Team:Search)

asxzy commented 10 months ago

It would be great to see this feature. In my use case, I indexed the document by chunks and want to get out the topn chunks for RAG. More like top-k nested chunks but aggregated by document id. Would be great if it works with RRF as well.

xlianghang commented 10 months ago

This is useful for my scenario, where you need to both structure the subdata and count the relevance of each subdata item.

jasper-s commented 10 months ago

Very useful indeed, our usecase would also benefit from this addition greatly.

Use case is similar to the article example earlier. Our parent document is split into different subsections, each of which is individually stored as nested document with embeddings. These subsections are not necesssarily related to each other.

Upon searching, we'd like to retrieve both the parent documents that contain one or more matching subsections (already supported now), but also the exact subsections that matched (not supported yet). The latter would allow us to use a language model to then further summarize all the relevant subsections.