elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.97k stars 24.75k forks source link

Using search-after with Point-in-time occasionally failing with "No search context found for id" error #102753

Open harshvikramsingh opened 11 months ago

harshvikramsingh commented 11 months ago

Elasticsearch Version

7.17.9

Installed Plugins

No response

Java Version

11.0.16 Open JDK

OS Version

Linux 5.4.254-170.358.amzn2.x86_64

Problem Description

We are using Search-after with Point-in-time API(https://www.elastic.co/guide/en/elasticsearch/reference/7.17/paginate-search-results.html#search-after) to iterate over all the records in our indices. Some description of our setup:

  1. Our indices are timeseries based, with around 10-15 indices created on daily basis(rollover criteria is max_size 100G or max_age 1 day).
  2. Each Index has 5 primary and 1 replica shard each, totalling 10 shards. Thus we have around 1 to 1.5 TB primary data per day.
  3. In addition to this, we are using Hot-Warm-Cold tiering (0-30 days Hot, 31-45 days Warm, >45 days Cold). Hot nodes have 2TB disk, while Warm and Cold ones are 16TB. We have 1 year data retention policy.
  4. Each Elasticsearch data node has 64GB RAM with 32 cores. JVM heap size is set to 32 GB.
  5. Mapping wise, our index has 25 fields. One is a timestamp field on which we do PIT sorting. Rest are keyword, text and boolean fields.

In between PIT search, after iterating few records, searches fail with "No search context found for id " error. Due to this, we are forced to retry the search for the given day again from scratch, which can be time consuming due to size of our data. We verified there are no node restarts/crashes in between and cluster is GREEN throughout the entire search. We are also able to fetch and process each page of results within PIT timeout of 5 mins.

  1. Could the PIT search context be impacted due to Shard rebalancing or Allocation (since we have data tiering, and daily some indices transition from Hot->Warm->Cold)?
  2. Any other reason for search context to get cleared in midway of PIT search?
  3. Is there any possibility to retry the PIT search using a new PIT ID to resume from where it left off last- by using last timestamp+_shard_doc sort combination? But documentation says _shard_doc is bound to PIT, so not sure if last _shard_doc is usable with new PIT id.

Since scroll is not recommended for large data set any more, we are using PIT search.

Steps to Reproduce

  1. Setup having 10 to 15 indices per day(5 primary and 1 replica shard each), each index with 100GB Primary data.
  2. Create a new PIT on daily index pattern like myindex-2023.10.04- `curl --silent --output "${search_out}" --write-out "%{http_code}" -H 'Content-Type: application/json' -XPOST "${source_es_host}:${ELASTICSEARCH_PORT}/myindex-2023.10.04-/_pit?keep_alive=5m"`
  3. Search query used in PIT search: {"track_total_hits":false,"size":10000,"pit":{"id":"'${pit}'","keep_alive":"5m"},"sort":[{"timestamp":{"order": "asc"}},{"_shard_doc":{"order":"asc"}}]}
  4. Set allow_partial_search_results=false in search query, since we want to iterate over ALL the records without missing any data. We do a count validation at end of search.
  5. We are able to fetch and process each page of results within PIT timeout of 5 mins, ES response time is 5-10secs max. Still we get below "No search context found for id " error.

Logs (if relevant)

{"error":{"root_cause":[{"type":"search_context_missing_exception","reason":"No search context found for id [7832]"},{"type":"search_context_missing_exception","reason":"No search context found for id [7621]"},{"type":"search_context_missing_exception","reason":"No search context found for id [798646]"},{"type":"search_context_missing_exception","reason":"No search context found for id [803118]"}],"type":"search_phase_execution_exception","reason":"Partial shards failure","phase":"fetch","grouped":true,"failed_shards":[{"shard":0,"index":"myindex-2023.10.04-000813","node":"OoT9xTmjTQSqg2mGlg-6JA","reason":{"type":"search_context_missing_exception","reason":"No search context found for id [7832]"}},{"shard":2,"index":"myindex-2023.10.04-000813","node":"7BdlfG1tRyuY68H8WDpu3w","reason":{"type":"search_context_missing_exception","reason":"No search context found for id [7621]"}},{"shard":3,"index":"myindex-2023.10.04-000813","node":"MXxNhSU_SGCHEuOCfq6UuA","reason":{"type":"search_context_missing_exception","reason":"No search context found for id [798646]"}},{"shard":4,"index":"myindex-2023.10.04-000813","node":"FIuWsiNASQu71bbEL2SQxg","reason":{"type":"search_context_missing_exception","reason":"No search context found for id [803118]"}}]},"status":404}

elasticsearchmachine commented 11 months ago

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine commented 3 months ago

Pinging @elastic/es-search-foundations (Team:Search Foundations)