Closed fjl82 closed 2 months ago
Greetings! Can you expand a bit on this?
alerts started coming in with a high message count but no messages listed in the email (normally max 3 are included)
Can you clarify what "high message count" means? Does this mean the search query for the event definition returned an usually high number of messages? OR is this specifically the event notification? Can you provide a screenshot to help clarify?
Using the search replay shows no messages either
Can you provide a screenshot? To clarify, this is when you click the replay search URL in the notificaiton or via the alerts screen in graylog?
Trying to downgrade Opensearch fails
This is expected unfortunately. Typically new OpenSearch versions introduce a new Lucene version. OpenSearch 2.16 updated lucene from 9.10 to 9.11.1. Unfortunately it is impossible to downgrade the lucene version of an OpenSearch cluster.
Can you share your server.log
or an except or any applicable error messages?
Thanks!
Hi! With "high message count" i mean this: The search query seemed to return nothing, but the count is high, even though it actually should be zero, and therefor triggers an alert (via email). The alert is configured to check every minute and alert every hour. Search replay: This is also reflected in the emails. Normally there are the first 3 messages included, but this is empty for these "false" emails.
The server.log does not seem to have much useful information regarding this, as far as I can tell. Just a lot of lines like:
2024-08-09T06:56:08.922+02:00 WARN [PivotAggregationSearch] Removing non-existing streams <[582b0b0561432f0403f4d8d0]> from event definition <5f4612866ece0c18ab09e6e4>/<...>
Not sure what they mean, but it sounds like some legacy events that no one cares about anymore and are probably broken due to invalid configuration. And this is only for a subset of the events, not all of them. The ones I care about do not have such log messages.
[edit:] Oh yeah, forgot to mention, the streams themselves still work fine, and viewing them insides graylog as well. If I open them from the Streams overview I can view messages like usual. There have been a few real events logged yesterday. This problem only applies to alerting.
[edit2:] I deleted the old unused alerts and streams and the server.log is now silent.
the streams themselves still work fine, and viewing them insides graylog as well. If I open them from the Streams overview I can view messages like usual
Can you confirm the stream IDs match between the id in server.log
and the id in the browser address bar for that stream?
I deleted the old unused alerts and streams and the server.log is now silent.
Good to hear. Are you confident OpenSearch 2.16.0 is working as expected and Graylog is working as expected?
I think we're seeing the same issue after an upgrade. I tried a quick grep for "Removing non-existing" in server.log, but didn't see anything.
For example, one alert has a condition of "count > 600", and a replay of the search shows a count 14. But the alert is triggered due to the claimed count of "950013".
Some filter that's not set correctly anymore?
We seem to have upgraded both opensearch from 2.14.0 to 2.16.0 and graylog from 6.0.2 to 6.0.5 today.
the streams themselves still work fine, and viewing them insides graylog as well. If I open them from the Streams overview I can view messages like usual
Can you confirm the stream IDs match between the id in
server.log
and the id in the browser address bar for that stream?I deleted the old unused alerts and streams and the server.log is now silent.
Good to hear. Are you confident OpenSearch 2.16.0 is working as expected and Graylog is working as expected?
I think you misunderstood. The lines in the server.log were related to old streams I was no longer interested in. So I deleted those streams, and those specific lines in server.log have stopped. But those had no relation to the problem that I was seeing. I posted those lines because you asked for things in server.log, not because it had a connection to the issue. Those "removing non-existing" lines were caused by some alert was referring to a stream that was already deleted.
What @dhedberg is saying sounds the same as the problem we're having. What I didn't mention before is that I also had this problem on graylog 6.0.4. This version was running on the day I found the issue. I upgraded to 6.0.5 to see if it would resolve the problem, but it didn't.
@fjl82 thank you for clarifying.
Do you feel comfortable sharing your event definition that is causing thing? If possible exporting it to a content pack? My goal is try to and understand how to reproduce the issue.
If i may: a summary of the issue is that your event criteria is not behaving as expected, the resulting query returns much more data than you expect (you expect 0). The replay search (or running the search query directly) show different results than the event.
Is this correct?
Without having made any effort to understand the code and queries involved I took a quick look at the opensearch issue tracker.
Might https://github.com/opensearch-project/OpenSearch/issues/15169 be related? Just based on the fact that it apparently broke in 2.16.0 and involves a query being ignored.
@dhedberg potentially, yes. We were looking into that very issue to see if it could be the root cause here, but we were unable to recreate the issue in our test environments running OpenSeach 2.16.0. We were hoping to get more information about any event definitions that were causing the problem so we can reliably reproduce the issue and figure out if it is on our end or due to that (or another) OpenSearch issue.
all our events are causing this. I don't know anything about content packs (never used or created them), but our event definition looks like this: It is using a stream with very simple rules:
I've read that Opensearch issue report and it does sound like it can cause this problem. So it might not be a Graylog issue after all, it's likely that there's nothing to fix on Graylog side. Maybe it's just a simple Opensearch bug that will be fixed soon in a minor update. I'd just wish they would make it easy to rollback to a previous version, the current version lock is very problematic.
I can confirm that the OS queries we are generating in alerting do not return proper results against 2.16.0. All filters are ignored due to the usage of the date_range
aggregation. This aggregation is not used in search, i.e. when the generated event is replayed, which explains why results are as expected then.
I can confirm that after upgrading graylog-server from 6.0.4 to 6.0.5 and OpenSearch from 2.15.0 to 2.16.0 (OS = AlmaLinux 9.4) the Alerts are not working anymore, the count function is matching with numbers which aren´t showed when replaying the search. Example: Event definition:
Event match:
Replaying search:
There seems to be a workaround: https://github.com/opensearch-project/OpenSearch/issues/15169#issuecomment-2289241726
Thanks @bernd, I applied this setting and alerts seem to work ok again now.
I can confirm that the workaround mentioned by @bernd is working for me, thanks!
We have published an advisory regarding this issue that includes the work-around here https://graylog.org/post/alert-notice-opensearch-v2-16/
Last night, Opensearch got upgraded from 2.15.0 to 2.16.0. Nothing else was changed. After this, alerts started coming in with a high message count but no messages listed in the email (normally max 3 are included). Using the search replay shows no messages either. It seems to apply to all configured alerts. Normal message searches also work fine. Trying to downgrade Opensearch fails. On startup it stops with an error: java.lang.IllegalStateException: cannot downgrade a node from version [2.16.0] to version [2.15.0] If you need me to check anything, or need more info, let me know.
Expected Behavior
Alerts should behave as before on 2.15.0.
Current Behavior
Alerts keep triggering with an ever rising message count
Possible Solution
Support opensearch 2.16.0
Steps to Reproduce (for bugs)
Context
Alerts are currently unusable. This is the feature in Graylog we use most (alerting us to application issues).
Your Environment