elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.21k stars 24.85k forks source link

[HealthAPI] Warn when ILM is not making progress #93859

Open andreidan opened 1 year ago

andreidan commented 1 year ago

Description

ILM will wait for conditions to be fulfilled before making progress. E.g. it waits for indices to be GREEN in the wait-for-active-shards step

It'd be useful for the health API to signal if ILM is not making progress. We could derive a configurable heuristic based on the time ILM entered a step, available in the ILM execution state (and visible via the explain API under step_time) and signal a YELLOW status if ILM is idling in a wait-* step for more than 24 (or 48?) hours.

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-data-management (Team:Data Management)

HiDAl commented 1 year ago

Checking the current code used by the /<index>/_ilm/explain endpoint, the status is calculated in the master node using information coming from the ClusterState.IndexMetadata. This brings the user the freshest information about the status of the indices. Since the complexity of this calculation is linear to the number of indexes, it increases the load in the master node.

In the case of the current ILM Health Indicator, this information will be calculated per each managed index by an ILM policy. So, we have 2 options here:

  1. Use the same approach that is used by the ILM explain endpoint and always return the freshest ILM information from the master node. As stated before, for this indicator, we'll have to calculate based on the total number of indices managed by an ILM policy. This could generate extra load in the master node for big clusters.
  2. To use the coordinator node to calculate the status of the indicator. This has the tradeoff that the information could have some lag compared with the _ilm/explain endpoint, but the total load will be spread across all the cluster's nodes.

Due to calculating the status based on the time spent on a phase/action/step and this could be in the order of hours/days, I think it's safe to go with the second option.

/cc @andreidan

--

Thanks @gmarouli for helping me to understand the different approaches

HiDAl commented 1 year ago

I've been thinking about the fact we have to traverse the whole list of indices in the cluster and that could be a heavy task. Since this is an indicator, we could take advantage of the verbose option and short-circuit the execution of the indicator in case we found any issue with ILM and verbose = false, then display a generic enough detail, like, there are issues with your ILM policies check the _ilm/explain endpoint or so. Now, in case the user pass verbose = true, the details would have information about the specific indices with issues.

HiDAl commented 1 year ago

About the last comment: We are already following the approach of do not show all the diagnosis information in case the verbose parameter is false in the ShardsAvailabilityHealthIndicatorService indicator. About short-circuit the execution, I'm fully sure yet

andreidan commented 1 year ago

Regarding: https://github.com/elastic/elasticsearch/issues/93859#issuecomment-1538064504

@HiDAl we can use the health coordinator local ClusterState to look into the ILM explain (without reaching to the master node).

We have the master_is_stable indicator that, when NOT GREEN, will turn all other indicators to unknown. This is precisely because the local information on the health coordinating node is probably not sufficiently up-to-date.

+1 to use the local cluster state.

I believe a grouping of sorts is needed when reporting the problems. Maybe per ILM step or ILM action?

HiDAl commented 1 year ago

Temporal example of the possible output:

GET _health_report/ilm?
{
  "cluster_name": "runTask",
  "indicators": {
    "ilm": {
      "status": "yellow",
      "symptom": "Some indices have been stuck on the same action longer than expected.",
      "details": {
        "stuck_indices": 2,
        "stuck_indices_per_action": {
          "downsample": 0,
          "allocate": 0,
          "shrink": 0,
          "searchable_snapshot": 0,
          "rollover": 2,
          "forcemerge": 0,
          "delete": 0,
          "migrate": 0
        },
        "policies": 19,
        "ilm_status": "RUNNING"
      },
      "impacts": [
        {
          "id": "elasticsearch:health:ilm:impact:index_stuck",
          "severity": 3,
          "description": "Some indices have been longer than expected on the same Index Lifecycle Management action. The performance and stability of the cluster could be impacted.",
          "impact_areas": [
            "ingest",
            "search"
          ]
        }
      ],
      "diagnosis": [
        {
          "id": "elasticsearch:health:ilm:diagnosis:stuck_action:rollover",
          "cause": "Some indices managed by the policy [some-policy-2] have been stuck on the action [rollover] longer than the expected time [time spent in action: 2s].",
          "action": "Check the current status of the Index Lifecycle Management service using the [/_ilm/explain] API.",
          "help_url": "https://ela.st/ilm-explain",
          "affected_resources": {
            "indices": [
              ".ds-policy2-1-2023.05.11-000003"
            ]
          }
        },
        {
          "id": "elasticsearch:health:ilm:diagnosis:stuck_action:rollover",
          "cause": "Some indices managed by the policy [some-policy] have been stuck on the action [rollover] longer than the expected time [time spent in action: 2s].",
          "action": "Check the current status of the Index Lifecycle Management service using the [/_ilm/explain] API.",
          "help_url": "https://ela.st/ilm-explain",
          "affected_resources": {
            "indices": [
              ".ds-test-001-2023.05.11-000003"
            ]
          }
        }
      ]
    }
  }
}