Open andreidan opened 1 year ago
Pinging @elastic/es-data-management (Team:Data Management)
Checking the current code used by the /<index>/_ilm/explain
endpoint, the status is calculated in the master node using information coming from the ClusterState.IndexMetadata
. This brings the user the freshest information about the status of the indices. Since the complexity of this calculation is linear to the number of indexes, it increases the load in the master node.
In the case of the current ILM Health Indicator, this information will be calculated per each managed index by an ILM policy. So, we have 2 options here:
ILM explain endpoint
and always return the freshest ILM information from the master node. As stated before, for this indicator, we'll have to calculate based on the total number of indices managed by an ILM policy. This could generate extra load in the master node for big clusters._ilm/explain
endpoint, but the total load will be spread across all the cluster's nodes.Due to calculating the status based on the time spent on a phase/action/step and this could be in the order of hours/days, I think it's safe to go with the second option.
/cc @andreidan
--
Thanks @gmarouli for helping me to understand the different approaches
I've been thinking about the fact we have to traverse the whole list of indices in the cluster and that could be a heavy task. Since this is an indicator, we could take advantage of the verbose
option and short-circuit the execution of the indicator in case we found any issue with ILM and verbose = false
, then display a generic enough detail, like, there are issues with your ILM policies check the _ilm/explain endpoint
or so. Now, in case the user pass verbose = true
, the details would have information about the specific indices with issues.
About the last comment: We are already following the approach of do not show all the diagnosis information in case the verbose
parameter is false in the ShardsAvailabilityHealthIndicatorService
indicator.
About short-circuit the execution, I'm fully sure yet
Regarding: https://github.com/elastic/elasticsearch/issues/93859#issuecomment-1538064504
@HiDAl we can use the health coordinator local ClusterState
to look into the ILM explain (without reaching to the master node).
We have the master_is_stable
indicator that, when NOT GREEN, will turn all other indicators to unknown
. This is precisely because the local information on the health coordinating node is probably not sufficiently up-to-date.
+1 to use the local cluster state.
I believe a grouping of sorts is needed when reporting the problems. Maybe per ILM step or ILM action?
Temporal example of the possible output:
GET _health_report/ilm?
{
"cluster_name": "runTask",
"indicators": {
"ilm": {
"status": "yellow",
"symptom": "Some indices have been stuck on the same action longer than expected.",
"details": {
"stuck_indices": 2,
"stuck_indices_per_action": {
"downsample": 0,
"allocate": 0,
"shrink": 0,
"searchable_snapshot": 0,
"rollover": 2,
"forcemerge": 0,
"delete": 0,
"migrate": 0
},
"policies": 19,
"ilm_status": "RUNNING"
},
"impacts": [
{
"id": "elasticsearch:health:ilm:impact:index_stuck",
"severity": 3,
"description": "Some indices have been longer than expected on the same Index Lifecycle Management action. The performance and stability of the cluster could be impacted.",
"impact_areas": [
"ingest",
"search"
]
}
],
"diagnosis": [
{
"id": "elasticsearch:health:ilm:diagnosis:stuck_action:rollover",
"cause": "Some indices managed by the policy [some-policy-2] have been stuck on the action [rollover] longer than the expected time [time spent in action: 2s].",
"action": "Check the current status of the Index Lifecycle Management service using the [/_ilm/explain] API.",
"help_url": "https://ela.st/ilm-explain",
"affected_resources": {
"indices": [
".ds-policy2-1-2023.05.11-000003"
]
}
},
{
"id": "elasticsearch:health:ilm:diagnosis:stuck_action:rollover",
"cause": "Some indices managed by the policy [some-policy] have been stuck on the action [rollover] longer than the expected time [time spent in action: 2s].",
"action": "Check the current status of the Index Lifecycle Management service using the [/_ilm/explain] API.",
"help_url": "https://ela.st/ilm-explain",
"affected_resources": {
"indices": [
".ds-test-001-2023.05.11-000003"
]
}
}
]
}
}
}
Description
ILM will wait for conditions to be fulfilled before making progress. E.g. it waits for indices to be GREEN in the
wait-for-active-shards
stepIt'd be useful for the health API to signal if ILM is not making progress. We could derive a configurable heuristic based on the time ILM entered a step, available in the ILM execution state (and visible via the explain API under
step_time
) and signal a YELLOW status if ILM is idling in await-*
step for more than 24 (or 48?) hours.