Enhance "Cluster health" stack monitoring rule to allow user configuration for Yellow/Red/Both

elastic / kibana

Your window into the Elastic Stack

https://www.elastic.co/products/kibana

Other

19.72k stars 8.14k forks source link

Enhance "Cluster health" stack monitoring rule to allow user configuration for Yellow/Red/Both #113445

Open ravikesarwani opened 2 years ago

ravikesarwani commented 2 years ago

As part of "Cluster health" rule allow users to configure if they want to receive alert for Yellow, Red, or Both yellow and red. The default configuration value for the rule will stay as "Both yellow and red".

Combining with our changes in 7.15 to allow multiple rules of the same type users can now configure different actions for Yellow(say email) and Red(say pagerduty), if they want.

Currently the Cluster health rule fires when the cluster health status changes from green to yellow OR red. There is no way for the users to configure to get alert only when the cluster state changes to "red".

Yellow status can happen based on temporary processing in Elasticsearch. Any action that creates a new index (rollover, shrink, mounting an index, close-and-reopen (through forcemerge w/codec change)) can cause the cluster to go briefly yellow.

Stretch goal Besides adding the extra configuration(for Yellow, Red, or Both) we should look at the possibility of "look at last X minutes of data and alert only when we see all of them to be the same status" rather than just relying on the last document status.

stefnestor commented 10 months ago

I'll also note for public record ILM Searchable Snapshots coming up on Frozen tier can blip the cluster status:red with no action required by Dev on-calls, e.g. Elasticsearch logs per index hitting phase/action/step: frozen/searchablesnapshots/mount-snapshot (or maybe wait-for-index-color sorry these happen really close together so I can't fully tell):

Cluster health status changed from [YELLOW] to [RED] (reason: [snapshot shard size updated]).
Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[partial-restored-my_index-2023.10.31-000001][0]]]).

Which'd resolve relating to the stretch goal in description

we should look at the possibility of "look at last X minutes of data and alert only when we see all of them to be the same status" rather than just relying on the last document status.

Which kinda overlaps with https://github.com/elastic/kibana/issues/145843

VimCommando commented 10 months ago

On a related node, all the built-in rules should be using the _health_report API indicators and not the _cluster/health indicators.

The _health_report understands if shards are unassigned due to expected cluster actions, like new indices or restarting nodes: https://github.com/elastic/elasticsearch/blob/3636d3d6ac492dda2dc2400e104b69319b753daa/server/src/main/java/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java#L408-L410

I don't know for sure if it tracks ILM transitions yet.

stefnestor commented 10 months ago

Publicly documenting lower stack versions workaround/alternative via manual Rule setup.

Example is taken on Elastic Cloud against version v8.9.2 for Logs&Metrics data:

Create Data View for .ds-.monitoring-es*
Create EQL Rule is above count 20 for last 5mins for Lucene filter cluster_state.status:red AND event.dataset:elasticsearch.cluster.stats. (Since Logs&Metrics polls every 10s, we're calculating 66% (arbitrary threshold I chose for example) of Xmins/10s where I also arbitrarily decided Xmins as 5mins.)