Open ravikesarwani opened 2 years ago
I'll also note for public record ILM Searchable Snapshots coming up on Frozen tier can blip the cluster status:red
with no action required by Dev on-calls, e.g. Elasticsearch logs per index hitting phase/action/step: frozen/searchablesnapshots/mount-snapshot
(or maybe wait-for-index-color
sorry these happen really close together so I can't fully tell):
Cluster health status changed from [YELLOW] to [RED] (reason: [snapshot shard size updated]).
Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[partial-restored-my_index-2023.10.31-000001][0]]]).
Which'd resolve relating to the stretch goal in description
we should look at the possibility of "look at last X minutes of data and alert only when we see all of them to be the same status" rather than just relying on the last document status.
Which kinda overlaps with https://github.com/elastic/kibana/issues/145843
On a related node, all the built-in rules should be using the _health_report
API indicators and not the _cluster/health
indicators.
The _health_report
understands if shards are unassigned due to expected cluster actions, like new indices or restarting nodes:
https://github.com/elastic/elasticsearch/blob/3636d3d6ac492dda2dc2400e104b69319b753daa/server/src/main/java/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java#L408-L410
I don't know for sure if it tracks ILM transitions yet.
Publicly documenting lower stack versions workaround/alternative via manual Rule setup.
Example is taken on Elastic Cloud against version v8.9.2 for Logs&Metrics data:
.ds-.monitoring-es*
cluster_state.status:red AND event.dataset:elasticsearch.cluster.stats
. (Since Logs&Metrics polls every 10s, we're calculating 66% (arbitrary threshold I chose for example) of Xmins/10s where I also arbitrarily decided Xmins as 5mins.)
As part of "Cluster health" rule allow users to configure if they want to receive alert for Yellow, Red, or Both yellow and red. The default configuration value for the rule will stay as "Both yellow and red".
Combining with our changes in 7.15 to allow multiple rules of the same type users can now configure different actions for Yellow(say email) and Red(say pagerduty), if they want.
Currently the
Cluster health
rule fires when the cluster health status changes from green to yellow OR red. There is no way for the users to configure to get alert only when the cluster state changes to "red".Yellow status can happen based on temporary processing in Elasticsearch. Any action that creates a new index (rollover, shrink, mounting an index, close-and-reopen (through forcemerge w/codec change)) can cause the cluster to go briefly yellow.
Stretch goal Besides adding the extra configuration(for Yellow, Red, or Both) we should look at the possibility of "look at last X minutes of data and alert only when we see all of them to be the same status" rather than just relying on the last document status.