[Response Ops][Alerting] Investigate ES query rule firing unexpectedly

ymao1 commented 7 months ago

We have gotten several reports from users of receiving alert notifications from the ES query rule where they were unable to trace back to the underlying documents that may have generated the alert. We need to investigate how and why that may be happening.

There seem to be several commonalities between the rule definitions that spawn the zombie alerts:

They are KQL queries (using the search source client to query)
They are querying over CCS indices (there is a theory that an alert might fire if the CCS connection is lost/broken)
They have excludeMatchesFromPreviousRuns set to true.

elasticmachine commented 7 months ago

Pinging @elastic/response-ops (Team:ResponseOps)

pmuellr commented 7 months ago

I think we should start with a code review. One possibility here is that for some reason the date we are starting the search from is somehow set further back than we want (or not set at all), and so we find older documents. I already did this once, didn't see any obvious way this could happen, but would be good to have another set of eyes. Note that we store the date to start querying from based on the dates of the documents returned, which is stored in the rule task state.

I think we could perhaps add some kind of diagnostic as well. Have the rule peek at the documents returned, and determine if they are in the search range. If not, log an error message with tons of info (search JSON, search options, results), and a unique tag we can search on, then remove those from the "active documents found".

pmuellr commented 7 months ago

I tried out an internal sarch facility to see if elasticsearch is known to return documents not matching the specified filter:

question:

Does elasticsearch ever return documents not matching the search filter?

answer:

> Yes, Elasticsearch can sometimes return documents that do not match the search filter due to various reasons, including: > > 1. Query and Filter Context Misuse: If a query is placed in a filter context or vice versa, it may not behave as expected. Filters are cached and do not contribute to scoring, while queries are scored. Using them incorrectly can lead to unexpected results. > > 2. Cache Issues: The results of filters are cached for faster performance. If there's an issue with the cache, it might return incorrect results. > > 3. Shard Failures: If there are issues with some of the shards and they fail to return the correct documents, the overall result set might include documents that do not match the filter. > > 4. Mapping Issues: If the field mappings are not correctly defined, Elasticsearch might not be able to apply filters correctly, leading to unexpected documents in the result set. > > 5. Analyzer Configuration: If the field being filtered on is analyzed, the filter will apply to the tokens produced by the analyzer, not the raw text. This can lead to unexpected behavior if the analyzer configuration is not aligned with the expectations of the filter. > > 6. Version Mismatches: In a scenario where there are mixed versions of Elasticsearch nodes in a cluster, there might be inconsistencies in how filters are applied. > > 7. Nested Documents: When dealing with nested documents, if the query is not correctly structured to handle the nested nature of the data, documents that do not match the filter criteria might be returned. > > 8. Specific Bugs or Edge Cases: There might be specific bugs or edge cases in Elasticsearch that could cause it to return documents that do not match the search filter. For example, as mentioned in the provided context, App Search filters had unexpected behaviors when the filtered field name matched specific keywords like to, from, distance, unit, center, etc. This issue was fixed in versions 7.16 and above (as per the context from "/knowledge/view/d672e2bb"). > > In general, Elasticsearch is designed to return accurate results based on the provided filters. However, configuration issues, bugs, or other unforeseen circumstances can lead to unexpected results. It's important to ensure that the cluster is healthy, the mappings are correct, and the queries are well-formed to avoid such issues. > > For the specific case of App Search filters returning unexpected results due to field names matching specific keywords, this was identified as a bug and fixed in later versions of Elasticsearch (as per the context from "/knowledge/view/d672e2bb").

We are currently placing the time window for the search in a range filter - so I don't think we're misuing query / filter here (1).

Seems unlikely to be 4 as well, since we only see this transiently - presumably we'd see this more consistently if there was a mapping issue, for instance, the same old doc showing up in the search hits for subsequent rule runs. We don't.

We'd have to check if there is an analyzer for the time fields, but it seems hard to imagine that's the problem (5) - again, we'd see the same old docs in subsequent runs, but we've not seen that in practice.

AFAIK we do not have multiple versions of ES in the mix here, ruling out 6.

These aren't nested documents, ruling out 7.

Leaving 2, 3, and 8. I have NO IDEA how accurate this answer is, it's somewhat AI-generated :-). The simplest answer and perhaps easiest to check is if we have a shard failure (3). I guess look in the ES logs around the time of the rule run?

pmuellr commented 3 months ago

I've looked into this a few times, and come up with nothing.

One thing from https://github.com/elastic/sdh-kibana/issues/4177 , is that it appears old documents, out of the range we were searching through, were returned as hits. That could explain the other referenced issues as well.

So, here's one thing we can do to try to "catch" that issue: add some code after the query is run to check that the documents' time range matches what we were searching for. If we find documents out of the range (basically, old documents that should not have been in the search), dump a bunch of info to the logger. The query, the document id's / times, maybe some of the search result meta data to see if there are any clues there. Perhaps we should even return an error from the rule run, so that we do NOT create alerts, and make the failure more obvious (than just being logged).

These appear to be transient issues, so very likely the customer wouldn't notice the error in the rule run since the next one is likely to be successful, so I'm a little torn on throwing the error.

pmuellr commented 1 month ago

Woops, didn't really mean to close this - we still haven't figured out the problem, but do have some new diagnostics from #186332 if we see this happen again ...

elastic / kibana

[Response Ops][Alerting] Investigate ES query rule firing unexpectedly #175980