elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.66k stars 8.23k forks source link

[Rules and Alerting][Stack Monitoring] Prebuilt Rules for Elasticsearch Health API #190663

Open stefnestor opened 3 months ago

stefnestor commented 3 months ago

👋 howdy, team!

Elastic Support's default historical stance is that Elastic Cloud users should enable Logs&Metrics to ship data into Stack Monitoring and then enable its Alerts to be notified when administrative action is required on their Deployment(s).

Elasticsearch built out its Health API, which was earmarked Elastic-only in initial v8.3 but earmarking was later dropped in v8.7. This API is used to power Elastic Cloud > Deployment > Health to UI notify users when intervention is required. AFAICT there is no email automation corresponding to this UI.

image

These two areas now overlap to some degree

alert Stack Monitoring Alerts Health API
master_is_stable (similar to cluster alerting) true
shards_availability (similar to cluster alerting) true
disk true true
repository_integrity FALSE true
ilm FALSE true
slm FALSE true
shards_capacity FALSE true
cpu usage true FALSE
jvm memory true FALSE
missing monitoring data true NA
thread pool rejections true FALSE
ccr read exceptions true FALSE
large shard size true FALSE
cluster alerting (yellow, versions, nodes add/remove, license) true FALSE

AFAICT the Health API would be beneficial and directly actionable for all our admin users agnostic to on-prem/cloud-hosted. IMO it would be a good win across the board to enable all admins to be notified about our top known health checks. In particular, Support volume could be curbed by users being notified earlier of ILM, cluster default detection, and alternative disk stats.

Therefore, I'd like to request that

  1. Extend Stack Monitoring's ES-related metrics/stats to include its Health API data
  2. Extend Stack Monitoring to surface related data
  3. Extend Stack Monitoring Alerts to notify upon issue

I am not fully sure which if Kibana's "Stack Monitoring" covers only its UI or also its monitoring data ingest, so have reached out internally in Slack for help getting this to the right team. Also noting, this discussion potentially is quick&easy or a huge project ask covering multiple teams; I don't know yet 🙂.

For (1) above, confirmation not currently enabled in v8.15.0:

data --- image image image image Note just above does show data but not Health API data. ---

Related

syepes commented 8 hours ago

Effectively this could be a great addition and it would also simplify custom needs