elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.53k stars 24.61k forks source link

rank_features fields should support exists queries #98096

Open ioanatia opened 1 year ago

ioanatia commented 1 year ago

Description

When using a ML model with that outputs sparse vectors (like ELSER), it's possible that some times the inference pipeline fails when indexing new documents and new ingested documents are not enriched with the rank_features fields.

One approach in this case is to run an update by query and issue a reindexing with the ml pipeline of the documents in place in the same index. In this case, we would want to only update the documents that do not have rank_features fields, for example:

POST search-books/_update_by_query?pipeline=ml-inference-books
{
  "query": {
    "bool": {
      "must_not": {
        "exists": {
          "field": "ml.inference.description_expanded.predicted_value"
        }
      }
    }
  }
}

In theory this a much simpler approach then duplicating all the data to reindex in another index or to reissue bulk indexing requests from scratch. It only requires a single Elasticsearch API request.

the problem is that rank_features fields do not support exists queries.

{
  "error": {
    "root_cause": [
      {
        "type": "query_shard_exception",
        "reason": "failed to create query: [rank_features] fields do not support [exists] queries",
        "index_uuid": "WkxxefxdTuCiNeLy4b33jw",
        "index": "search-books"
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "search-books",
        "node": "l5lxDFkeSwOyiwOEkQtWZA",
        "reason": {
          "type": "query_shard_exception",
          "reason": "failed to create query: [rank_features] fields do not support [exists] queries",
          "index_uuid": "WkxxefxdTuCiNeLy4b33jw",
          "index": "search-books",
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "[rank_features] fields do not support [exists] queries"
          }
        }
      }
    ]
  },
  "status": 400
}

This also makes it difficult to get an accurate count of how many fields are missing the rank_features fields and users have to rely on other fields/mechanisms to do a reindexing in place with update by query.

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-search (Team:Search)

javanna commented 1 year ago

I believe this feature is something that @russcam and team may also be interested in.

jimczi commented 1 year ago

Absolutely on board with adding "exists" query support for the "rank_features" field! I noticed the issue talks about the inference pipeline not working as expected. Got me thinking – maybe there's a more effective way to spot those partial documents? How about relying on an error field? This could help us catch pipeline hiccups while still getting the document indexed. Just a thought!

benwtrent commented 1 year ago

There are two things here for "exists" on "rank_features"

I think @russcam and co. want "exists on a rank_features feature", not just on the field itself.

russcam commented 1 year ago

That's correct @benwtrent, the latter is what we would be interested in

benwtrent commented 8 months ago

Quick update here, while we don't support exists for a specific feature, the underlying logic would probably be the same as a term query as we cannot infer if a feature exists or not without either:

@russcam I know this is "late to the party", but to check if a feature "exists" or not, you can do a term query against it. So, you can then filter the query against a particular feature and then score via the rank_features if you wish.

So, to find feature bar in rank_features foo you could do

"query": {
  "term": {
    "foo": "bar"
  }
}

This will filter for all docs where rank_features foo has the particular feature bar

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)