elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.69k stars 24.66k forks source link

(HealthAPI) Oscillating report for ILM Health #113553

Open stefnestor opened 5 days ago

stefnestor commented 5 days ago

Elasticsearch Version

8.15.1

Installed Plugins

NA

Java Version

bundled

OS Version

ESS

Problem Description

👋🏽 howdy, team! While recording Monitoring ILM Health on Elastic Cloud, we noticed the ES Health Report API oscillates reporting ongoing ILM problems (assumed because ILM's polling).

For our 40min video, we couldn't get the Health Report to acknowledge our three problematic indices at all. I left the cluster for 1d and then afterwards could get it to report but still oscillating related to the ILM poll interval even though the index never leaves step:ERROR.

Video on short poll interval for demonstration purposes:

https://github.com/user-attachments/assets/1cbf805f-3e91-4290-bead-3a3297978685

Noting ESS will respond by HTTP 404 erring DEPLOYMENT_URL/health/ilm while ILM is not reporting issue but a page refresh will sometimes load an empty-report page. This page loads from the ES Health Report API so the highlight is that ESS faces unexpected behavior in response to an unexpected response from ES.

Steps to Reproduce

Kindly see Monitoring ILM Health on Elastic Cloud for more context if my write-up's insufficient but TLDR

  1. Put an index into a permanent ILM rollover failure
    # GET heytheredelilah-bad/_ilm/explain
    { "indices": {"heytheredelilah-bad": {
      "index": "heytheredelilah-bad",
      "managed": true,
      "policy": "summerflowers",
      "index_creation_date_millis": 1726964679826,
      "time_since_index_creation": "3.68d",
      "lifecycle_date_millis": 1726964679826,
      "age": "3.68d",
      "phase": "hot",
      "phase_time_millis": 1727282808063,
      "action": "rollover",
      "action_time_millis": 1726964680121,
      "step": "ERROR",
      "step_time_millis": 1727282818067,
      "failed_step": "check-rollover-ready",
      "is_auto_retryable_error": true,
      "failed_step_retry_count": 15906,
      "step_info": {
        "type": "illegal_argument_exception",
    -     "reason": "index [heytheredelilah-bad] is not the write index for alias [heytheredelilah]" },
      "phase_execution": {
        "policy": "summerflowers",
        "phase_definition": {
          "min_age": "0ms",
          "actions": {
            "rollover": {
              "min_docs": 1,
              "max_primary_shard_docs": 200000000,
              "max_docs": 1 } } },
        "version": 6,
        "modified_date_in_millis": 1726958004002
    } } } }
  2. Poll GET _health_report?filter_path=indicators.ilm & see that it oscillates reporting issue/not:

    # expected
    {"indicators": {"ilm": {
      "status": "yellow",
      "symptom": "An index has stayed on the same action longer than expected.",
      "details": {
        "stagnating_indices_per_action": {"allocate": 0, "shrink": 0, "searchable_snapshot": 0, "rollover": 1, "forcemerge": 0, "delete": 0, "migrate": 0 },
        "policies": 39,
        "stagnating_indices": 1,
        "ilm_status": "RUNNING"
      },
      "impacts": [{"id": "elasticsearch:health:ilm:impact:stagnating_index", "severity": 3, "description": "Automatic index lifecycle and data retention management cannot make progress on one or more indices. The performance and stability of the indices and/or the cluster could be impacted.", "impact_areas": ["deployment_management" ] } ],
      "diagnosis": [
        {
          "id": "elasticsearch:health:ilm:diagnosis:stagnating_action:rollover",
          "cause": "Some indices have been stagnated on the action [rollover] longer than the expected time.",
          "action": "Check the current status of the Index Lifecycle Management for every affected index using the [GET /<affected_index_name>/_ilm/explain] API. Please replace the <affected_index_name> in the API with the actual index name.",
          "help_url": "https://ela.st/ilm-explain",
          "affected_resources": {"ilm_policies": ["summerflowers" ], "indices": ["heytheredelilah-bad" ] }
        } ]
    } } }
    
    # bug
    {"indicators": {"ilm": {
      "status": "green",
      "symptom": "Index Lifecycle Management is running",
      "details": {"policies": 39, "stagnating_indices": 0, "ilm_status": "RUNNING" }
    } } }

Logs (if relevant)

No response

elasticsearchmachine commented 5 days ago

Pinging @elastic/es-data-management (Team:Data Management)