Kibana alert fires when it should not have due to temporary disconnect of remote CCS connection

henrikno commented 11 months ago

Kibana version: 8.10.2

Elasticsearch version: 8.10.2

Server OS version: Elastic Cloud

Original install method (e.g. download page, yum, from source, etc.): Elastic Cloud

Describe the bug: We have an alert that queries for a specific document showing up at least 8 times within 10 minutes over a remote CCS connection. The alert triggers, but when we check there were zero documents that match the query, and we did not delete any documents. The history does not say that the query failed, it shows up as "Succeeded", yet no info about what triggered it. The only hit that something iffy happened is that the query took 15 seconds instead of the normal 1-2 seconds.

Steps to reproduce:

Create a Kibana alert that queries over a remote connection every minute.
Restart nodes, do an upgrade, or disconnect the nodes in any way.
Kibana alert triggers, the history shows Succeeded, but no info about why it triggered. It does not show up as timeout or failed/unknown status.

Expected behavior: I expected the alert not to fire because there were no hits. Or at least give context about it firing because it could not get results.

Ideal scenario would be to not trigger if it's a transient issue, but if it's a sustained issue (for a configurable time), then trigger. For instance this seems to trigger when we do an upgrade, but then resolves itself.

Screenshots (if relevant):

Provide logs and/or server output (if relevant):

Any additional context:

elasticmachine commented 11 months ago

Pinging @elastic/response-ops (Team:ResponseOps)

pmuellr commented 7 months ago

Can you provide the rule type, and parameters used in the rule?

elkargig commented 7 months ago

another case where we had this problem was using the "Elasticsearch query" rule

rule check: every 5 minutes

pmuellr commented 7 months ago

potentially related to https://github.com/elastic/kibana/issues/168293

pmuellr commented 7 months ago

The action being used was iterating over the context.hits to print a field from the doc hits. We advised to also print {{_source._id}} from the hits, as we will then - in the future if this happens - see the actual document id's that the search returned. Hopefully this will provide more background into what is happening.

XavierM commented 7 months ago

@henrikno I talked to @ymao1 and @pmuellr about this issue. We have other SDH related to that problem but we do not have access to the data like here. For us to find a solution, we need to investigate but to do that we need to log a little bit more information in the message like that alertId (_id of the document) and the timestamp of the alert.

Do you think that's possible? and will we be able to access this kibana?

ymao1 commented 7 months ago

Created a dedicated investigation issue for this https://github.com/elastic/kibana/issues/175980 and linking this for the rule definition

elastic / kibana

Kibana alert fires when it should not have due to temporary disconnect of remote CCS connection #168293