Open ymao1 opened 7 months ago
Pinging @elastic/response-ops (Team:ResponseOps)
I think we should start with a code review. One possibility here is that for some reason the date we are starting the search from is somehow set further back than we want (or not set at all), and so we find older documents. I already did this once, didn't see any obvious way this could happen, but would be good to have another set of eyes. Note that we store the date to start querying from based on the dates of the documents returned, which is stored in the rule task state.
I think we could perhaps add some kind of diagnostic as well. Have the rule peek at the documents returned, and determine if they are in the search range. If not, log an error message with tons of info (search JSON, search options, results), and a unique tag
we can search on, then remove those from the "active documents found".
I tried out an internal sarch facility to see if elasticsearch is known to return documents not matching the specified filter:
question:
Does elasticsearch ever return documents not matching the search filter?
We are currently placing the time window for the search in a range
filter - so I don't think we're misuing query / filter here (1).
Seems unlikely to be 4 as well, since we only see this transiently - presumably we'd see this more consistently if there was a mapping issue, for instance, the same old doc showing up in the search hits for subsequent rule runs. We don't.
We'd have to check if there is an analyzer for the time fields, but it seems hard to imagine that's the problem (5) - again, we'd see the same old docs in subsequent runs, but we've not seen that in practice.
AFAIK we do not have multiple versions of ES in the mix here, ruling out 6.
These aren't nested documents, ruling out 7.
Leaving 2, 3, and 8. I have NO IDEA how accurate this answer is, it's somewhat AI-generated :-). The simplest answer and perhaps easiest to check is if we have a shard failure (3). I guess look in the ES logs around the time of the rule run?
I've looked into this a few times, and come up with nothing.
One thing from https://github.com/elastic/sdh-kibana/issues/4177 , is that it appears old documents, out of the range we were searching through, were returned as hits. That could explain the other referenced issues as well.
So, here's one thing we can do to try to "catch" that issue: add some code after the query is run to check that the documents' time range matches what we were searching for. If we find documents out of the range (basically, old documents that should not have been in the search), dump a bunch of info to the logger. The query, the document id's / times, maybe some of the search result meta data to see if there are any clues there. Perhaps we should even return an error from the rule run, so that we do NOT create alerts, and make the failure more obvious (than just being logged).
These appear to be transient issues, so very likely the customer wouldn't notice the error in the rule run since the next one is likely to be successful, so I'm a little torn on throwing the error.
Woops, didn't really mean to close this - we still haven't figured out the problem, but do have some new diagnostics from #186332 if we see this happen again ...
We have gotten several reports from users of receiving alert notifications from the ES query rule where they were unable to trace back to the underlying documents that may have generated the alert. We need to investigate how and why that may be happening.
There seem to be several commonalities between the rule definitions that spawn the zombie alerts:
excludeMatchesFromPreviousRuns
set to true.