elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.73k stars 8.14k forks source link

Alerts are not recovered correctly in ICMP monitoring due to 'ping timeout' #187592

Open iTiagoCO opened 2 months ago

iTiagoCO commented 2 months ago

Kibana version: 8.13.2

Elasticsearch version: 8.13.2

Describe the bug:

Alerts configured in the observability rule do not recover correctly when the condition that triggered them is no longer true. This appears to be bug-like behavior.

Steps to reproduce:

Configure a rule in Kibana to monitor ICMP status with condition MATCHING MONITORS ARE DOWN >= 3 times WITHIN last 10 minutes. Shut down one of the monitored hosts to generate a "ping timeout".

Note that the alert fires correctly but does not recover when the host comes back online.

Expected behavior:

Alerts should automatically recover when the original condition that triggered them is no longer met.

Additional attachment discuss created for the current case that contains more details, screenshots, logs, etc.

https://discuss.elastic.co/t/problem-with-alerts-recovered/362036

elasticmachine commented 2 months ago

Pinging @elastic/response-ops (Team:ResponseOps)

elasticmachine commented 2 months ago

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

elasticmachine commented 2 months ago

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

iTiagoCO commented 2 months ago

Add error terminal.

[ERROR][plugins.ruleRegistry] ResponseError: {"errors":true,"took":5,"ingest_took":3,"items":[{"create":{"_index":".internal.alerts-observability.uptime.alerts-default-000046","_id":"2b6bc7a5-16d5-40d5-a4dc-a49179f60e5d","status":409,"error":{"type":"version_conflict_engine_exception","reason":"[2b6bc7a5-16d5-40d5-a4dc-a49179f60e5d]: version conflict, document already exists (current version [1])","index_uuid":"LSIoSjsTTRuuGWTTmaz7fQ","shard":"0","index":".internal.alerts-observability.uptime.alerts-default-000046"}}},{"create":{"_index":".internal.alerts-observability.uptime.alerts-default-000046","_id":"831e1226-0373-4d0e-bcfc-81a97bea8b67","status":409,"error":{"type":"version_conflict_engine_exception","reason":"[831e1226-0373-4d0e-bcfc-81a97bea8b67]: version conflict, document already exists (current version [1])","index_uuid":"LSIoSjsTTRuuGWTTmaz7fQ","shard":"0","index":".internal.alerts-observability.uptime.alerts-default-000046"}}},{"create":{"_index":".internal.alert...

jasonrhodes commented 1 week ago

NOTE: Seems like there is a lot of additional detail in the linked discuss thread, it might be a good idea to try to capture more of that here in this issue. We'll try to recreate this and see what we can discover.