elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.72k stars 24.67k forks source link

Pre-filter nested fields in knn queries #106994

Open ezorita opened 6 months ago

ezorita commented 6 months ago

Description

By reading some of the latest bugfixes in 8.13.0 and the documentation it seems it is still not possible to pre-filter a knn search on nested mappings, such as:

{
   "mappings":{
      "properties":{
         "title":{
            "type":"keyword"
         },
         "paragraphs":{
            "type":"nested",
            "properties":{
               "language":{
                  "type":"keyword"
               },
               "vector":{
                  "type":"dense_vector",
                  "dims":3,
                  "similarity":"cosine",
                  "index":true
               }
            }
         }
      }
   }
}

then searching the following would never match any document:

{
   "query":{
      "nested":{
         "path":"paragraphs",
         "query":{
            "knn":{
               "field":"paragraphs.vector",
               "query_vector":[
                  0.5,
                  0.5,
                  0.5
               ],
               "num_candidates":5,
               "filter":{
                  "bool":{
                     "must":[
                        {
                           "match":{
                              "paragraphs.language":"EN"
                           }
                        }
                     ]
                  }
               }
            }
         }
      }
   }
}

or

{
   "query":{
      "nested":{
         "path":"paragraphs",
         "query":{
            "knn":{
               "field":"paragraphs.vector",
               "query_vector":[
                  0.5,
                  0.5,
                  0.5
               ],
               "num_candidates":5,
               "filter":{
                  "nested":{
                     "path":"paragraphs",
                     "query":{
                        "bool":{
                           "must":[
                              {
                                 "match":{
                                    "paragraphs.language":"EN"
                                 }
                              }
                           ]
                        }
                     }
                  }
               }
            }
         }
      }
   }
}

In the current implementation the filter scope is limited to the parent document, thus limiting other functionalities of nested documents.

I wonder how hard it would be to extend the filter context to nested documents?

Thank you!

ppaanngggg commented 6 months ago

Yes, same problem with @ezorita , the pre-filter of nested fields is a key feature to me.

I think the best DSL to this case should like this:

{
   "knn":{
      "query_vector": [0.5, 0.5, 0.5],
      "field": "paragraphs.vector",
      "k": 10,
      "num_candidates": 100,
      "filter": {
         "nested": {
            "path": "paragraphs",
            "query": {
               "term": {"paragraphs.language": "EN"}
            },
            "inner_hits": {
               "size": 5
            }
         }
      }
   }
}
xflashxx commented 5 months ago

Same for me

xflashxx commented 5 months ago

Any updates on whether this is doable at all? 🤗

mpaluch92 commented 4 months ago

I presume that this is doable, as Opensearch offers it as well: https://opensearch.org/docs/latest/search-plugins/knn/nested-search-knn/#k-nn-search-with-filtering-on-nested-fields

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)

GinoBerardelli commented 1 month ago

This would be great!

dstith-ip commented 1 month ago

Upvote for this feature!

gui11aume commented 1 month ago

Good feature.

mayya-sharipova commented 1 month ago

Thanks for highlighting the importance of this issue, we will look into it.

srgsol commented 1 month ago

Same problem for me. Great news!

joancf commented 1 month ago

Hi @ezorita and @mayya-sharipova Going back to the example above, There are some queries that are feasible using knn api but not the query api. And the problem are the inner_hits of the nested document. Inner hits returns the paragraph number of each selected paragraph on a document.

{"knn":
        { "field":"paragraphs.vector",
           "query_vector":[
                  0.5,
                  0.5,
                  0.5
                  ],
            "num_candidates":5,
            "filter":{
                  "nested": {
                      "path": "paragraphs",
                       "query": {     "match":{ "paragraphs.language":"EN"}},
                  }
            },
            "inner_hits": {
                  "_source": false,
                  "fields": ["paragraphs.number","paragraphs.language"]    
             }
   }
}

This search will produce results but we must be aware that it will return documents that have, at least, one EN paragraph, but then in the inner_hits it will return the ones doing best matching (being these English or not) So, basically the filter selects top documents based on the nested structure , but then does the search in all the nested elements

So I think it must be thought and clarified, if the knn considers all the nested documents of the filtered top documents, or if it only considers the 'inner_hits' produced by the query . So for example imagine that my top documents are trials, and the nested ones are different documents generated on that trial having a dense vector. We can try to do a match on any document of those trials that have a document called sentence and signed by judge J. Or we could do a match only on the sentences .

For me, the filter should start always at the top levels, and if it contains a nested clause with inner_hits clause then this nested levell should be considered the document to do the match otherwise the top document is considered (the full trial)

something like this

            "filter":{
                  "nested": {
                      "path": "paragraphs",
                       "query": {     "match":{ "paragraphs.language":"EN"}},
                  }
            },

versus

            "filter":{
                  "nested": {
                      "path": "paragraphs",
                       "query": {     "match":{ "paragraphs.language":"EN"}},
                      " inner_hits":{"source":False}
                  }
            },
ezorita commented 1 month ago

Hi @joancf, if I understand correctly you suggest that filter with nested fields works for the deprecated _knn_seach API but not for the _search API (this issue).

I wonder whether the filter parameter in _knn_search_ is truly a pre-filter and not the usual post-filter. Can you confirm @mayya-sharipova?

In any case, I agree it would be important to consider the inner_hits for the pre-filter DSL, otherwise one would need to recompute the similarities for all nested documents to figure out which one produced the best match.