Open damianharouff opened 2 months ago
If this gets rolled into Audit Log it'd be nice to have the duration searchable (e.g. audit log query execution_time_ms:>15000
), since the "slow" threshold may change when troubleshooting. I think we've brought this up before, but can't find the issue/case.
Audit log is an idea, although until https://github.com/Graylog2/graylog-plugin-enterprise/issues/7098 (Audit log entries into configurable stream) is implemented, there would be no way to events -> alerts on that data.
Further to the new ability to cancel long-running user searches: https://github.com/Graylog2/graylog2-server/pull/18308 a toggle-able option should be available to log slow searches, along with the user who triggered it.
This will be helpful for a busy Graylog installation where many users are running many searches, and the Graylog admin may want to ensure that users are not creating unreasonable searches that impact search cluster performance, e.g. log searches that take longer than 30 seconds to complete, before canceling at 60 seconds.
This stems from a situation encountered by a strategic customer where they have awareness that a user search is impacting their search cluster, but have no ability to understand which specific query is causing this without asking users currently logged into their system, and this may be hundreds of searches at a time. They have to assess each and every one of them manually, which is very time consuming. ZD 940 has more details too.
It's understood that Opensearch has functionality to log (or take further action on) slow queries, but these end up being the Graylog-computed query sent to Opensearch, which doesn't provide information about who is executing it, whereas Graylog can provide this information to the Graylog admin.
Understanding that we may not want to log the query itself due to concerns like the size of the query, or that it may contain sensitive information, a log entry as simple as "Search job 66e9b58b6a00143e28d8bbde started by user damian took more than
search_slow_seconds
to complete" would be very impactful for a Graylog administrator to investigate further.