Open henrikno opened 11 months ago
Pinging @elastic/response-ops (Team:ResponseOps)
Can you provide the rule type, and parameters used in the rule?
another case where we had this problem was using the "Elasticsearch query" rule
rule check: every 5 minutes
potentially related to https://github.com/elastic/kibana/issues/168293
The action being used was iterating over the context.hits
to print a field from the doc hits. We advised to also print {{_source._id}}
from the hits, as we will then - in the future if this happens - see the actual document id's that the search returned. Hopefully this will provide more background into what is happening.
@henrikno I talked to @ymao1 and @pmuellr about this issue. We have other SDH related to that problem but we do not have access to the data like here. For us to find a solution, we need to investigate but to do that we need to log a little bit more information in the message like that alertId
(_id
of the document) and the timestamp of the alert.
Do you think that's possible? and will we be able to access this kibana?
Created a dedicated investigation issue for this https://github.com/elastic/kibana/issues/175980 and linking this for the rule definition
Kibana version: 8.10.2
Elasticsearch version: 8.10.2
Server OS version: Elastic Cloud
Original install method (e.g. download page, yum, from source, etc.): Elastic Cloud
Describe the bug: We have an alert that queries for a specific document showing up at least 8 times within 10 minutes over a remote CCS connection. The alert triggers, but when we check there were zero documents that match the query, and we did not delete any documents. The history does not say that the query failed, it shows up as "Succeeded", yet no info about what triggered it. The only hit that something iffy happened is that the query took 15 seconds instead of the normal 1-2 seconds.
Steps to reproduce:
Expected behavior: I expected the alert not to fire because there were no hits. Or at least give context about it firing because it could not get results.
Ideal scenario would be to not trigger if it's a transient issue, but if it's a sustained issue (for a configurable time), then trigger. For instance this seems to trigger when we do an upgrade, but then resolves itself.
Screenshots (if relevant):
Provide logs and/or server output (if relevant):
Any additional context: