Closed mikecote closed 3 years ago
Pinging @elastic/kibana-alerting-services (Team:Alerting Services)
We should take another look at the APM-ization we did a while back, see if that still provides useful information. I believe we wrapped tasks as APM transactions, so get pretty good data from that - counts/durations per task and indirectly alert/action (since each alert and action has it's own task type).
Even though APM is not officially "on" today in Kibana, some day it will be, and presumably before it is "official" there may be some sneaky way to enable it. And on-prem customers could use it today. So we should make sure what we have in there today is going to be useful.
I'd like to advocate for integrating into Stack Monitoring in two ways:
1) Leverage the collection and shipping mechanisms that currently exist in Stack Monitoring to allow users to ship monitoring metrics and logs surrounding task manager/rules/connectors to different cluster(s)
2) Leverage the existing Stack Monitoring UI, specifically the Kibana monitoring UI, to visualize performance metrics for task manger/rules/connectors. The current hierarchy is very broad but we could add grouping by rule/connector type and show specific UIs for that data - similarly to how the Stack Monitoring UI handles things like ML and CCR monitoring.
I know the Stack Monitoring UI is in flux and the future is unclear, but it feels like the most straightforward path, as the UI and shipping/collection mechanisms are proven to work for users and reinventing that wheel will take time.
The event log can provide some pretty good data in terms of counts/durations of actions and alerts, but doesn't currently track any other task manager -related tasks. The index is read-only and considered a Kibana system index because of it's .kibana-
prefix, but assuming it's straight-forward to manually provide read privs to this index for users, it's straight-forward to create an index pattern for it, and then use it in Discover and Lens.
I just played with it again, and one of the problems with the current event log structure is that the saved object references are not usable within Lens, presumably because they are a nested field. I'm wondering if we can "runtime" them into fields that Lens could see? Otherwise, we can't get any rule-specific breakdowns.
The last set of changes for the event_log introduced the top-level rule
property, which is a great place we could put the rule-specific information: https://github.com/elastic/kibana/blob/8810e8484c46fd63b05a16950016aa1992a1509b/x-pack/plugins/event_log/generated/mappings.json#L173-L219
We'll need to figure out how to map all the goodies though - we'll want the rule type id and the rule id available, at a minimum. There are some related issues already open for this: https://github.com/elastic/kibana/issues/94137 and https://github.com/elastic/kibana/issues/95411
This won't help for actions though. I'd say if we can solve accessing the nested fields in Lens with runtime fields, let's go with that, otherwise we can provide a property connector
or such under the existing custom kibana
field, to add the connector id, space, etc.
Existing known issues regarding improving event log for diagnosis (just added to the list at the top of this issue):
To ensure users have the right tools and visibility to understand and react to the health and performance of the alerting framework (including task manager), we need to integrate into existing infrastructure and tooling provided to users of the stack (specifically thinking Stack Monitoring and/or APM). This will ensure we do not need to reinvent the wheel when it comes to how to collect/store/visualize metrics and logs, but rather help tell a holistic story for users on where they go and what they do to diagnose/detect/fix performance issues with their stack. We want users to go where they currently go to solve performance issues with Kibana
nested
fields mean we aren't able to use Lens to visualize the data? It looks like we wouldn't be able to) https://github.com/elastic/kibana/issues/100676I'm working on proving how this might integrate into Stack Monitoring in https://github.com/elastic/kibana/pull/99877
Hey @arisonl ,
From your perspective, what would be some helpful end-user outputs from this effort, most likely in terms of specifics charts that would be helpful?
I solicited some high level input about what general metrics would be helpful from @gmmorris and @pmuellr where they said:
Drift for sure, ideally with granularity at rule type level. Execution Duration, as we find the long running ones are often the ones causing trouble. Failure rate by rule type.
drift for overall "are there a lot of tasks queued" execution duration to how fast things are actually executing
I'm hoping to translate these into visualizations which will help in shaping and mapping the data properly. Do you have any thoughts on how to best represent these? Or perhaps you have additional thoughts on which metrics to collect too?
I imagine most of these will leverage line charts over time, but we could and should think about what each line chart could contain (like a drift line chart that has four lines representing p50, p90, p95, p99 drift, etc)
After spending more time working on this, I think we need to change the proposal somewhat drastically.
In the short term, we need to focus on delivering value in terms of helping support users with various issues with alerting and/or task manager. I think this starts with reusing existing tooling to help. The support team has an existing tool, support diagnostics, that takes a snapshot of the state of the cluster (in the form of calling a bunch of APIs). This tool is something that the support team uses with nearly every case that comes in, and it's usage can be slightly adapted to include Kibana metrics as well (it might already do this by default).
We can deliver value by enhancing the data the tool already collects, and also add more data points for collection, specifically referring to enhancing the event log and then adding specific queries for the tool to run against the event log to capture important data points.
In the long term, we will use the short term gains to help infer what our long term solution looks like. I still feel confident an integration with Stack Monitoring is the best route, but we need more time to flesh out what exactly we should collect before attempting that more.
After even MORE time thinking this through, the goal is to identify data points that are impossible to know when supporting users. We had a meeting where we identified and discussed these and these should be the focus on the first release of this effort. I'm going to list problem statements with a brief description and suggested remedial steps at the end.
However, the current thinking is that the initial delivery on this effort will only involve action items that solve the below problems. The assumption is that once these problems are resolved, we will be in a much better state to help users.
Once we feel confident that enough (as much as possible) "impossible"s are solved, it makes sense to pivot this thinking into how to deliver a better experience for us and for our users to give them the necessary insight into the health and performance of task manager and the alerting system. For now, I will not spend the time thinking through the next steps here, to ensure we focus on the value we need to ship in the initial delivery.
We have the ability to see health metrics but not necessarily when we need to see them (when the problem occurs). This is most noticeable when the issue is intermittent and isn't noticeable at first.
To combat this, we have a couple options:
We do not currently write to the event log when an rule starts execution (only when the rule finishes execution) so it's not possible for us to stitch together the timeline of rule execution to understand if one is starting and not finishing or something else.
To combat this, we should write to the event log when a rule begins execution
We've run into numerous issues with misconfiguration of a Kibana, and sometimes this Kibana is missed when looking at the infrastructure. This is primarily due to not having a reliable way to know how many Kibanas are talking to a single ES cluster.
To combat this, we need to learn more about our available tools. I think the best way to handle this is to rely on Stack Monitoring which should tell us how many Kibanas are reporting to a single Elasticsearch cluster. Kibana monitoring is on by default, as long as monitoring is enabled on the cluster, which should give us valuable insight into the infrastructure. Once we have the full list, we should be able to quickly identify misconfigurations, such as different encryption keys used on Kibanas that talk to the same Elasticsearch cluster.
cc @elastic/kibana-alerting-services to verify this list feels complete based on the conversations the past two days
For 7.14, we are aiming to deliver:
We are confident these two items (in addition to internal training around existing tools/solutions) will help us answer impossible questions with customer issues, such as "why was my action delayed by 5 minutes at this time yesterday?" and "why didn't my rule run at this time?"
Thanks @chrisronline
As this Epic is being worked on across multiple streams, I feel it's worth taking stock of what we have decided to prioritise and why.
As Chris noted above, in order to deliver on the success criteria stated for this Epic we decided to focus on problems that are currently either "impossible" to resolve in customer deployments, or at least extremely difficult.
With that in mind we took stock of these sorts of problems, as identified by our root cause analysis of past support cases, and prioritised the following issues:
These issues have been merged and, barring any unexpected defects, are aimed at inclusion in the nearing possible minor release.
Issue | Title | Why have we prioritised this |
---|---|---|
#98625 | [Task Manager] Health Metrics capacity estimation enhancements | Should dramatically reduce the time spent diagnosing scalability issues in the Alerting Framework |
#95411 | [event log] add rule type id in custom kibana.alerting field | Should dramatically reduce the time spent correlating rule events with specific rule types |
#94137 | [event log] populate rule.* ECS fields for alert events | Aligns our events with those produces by Security and Observability solutions, improving our ability to correlate issues across products in the stack |
#93704 | [discuss] extending event log for faster/easier access to active instance date information | Enables us to correlate active alert to actions, by including the activity duration in the Event Log |
#96683 | [alerts] http 500 response when creating an alert using an API key has the http authorization | Adds more context around API failures that are related to the use of API keys (This is more about automation than Observability, it it should help our support efforts, so feels related and worth noting) |
#98729 | Alerting docs are missing an example to list the top rules that are executing slowly | Docs changes enable customers to help themselves, freeing us up to respond to other customers with more complex issues |
#99160 | Improve Task Manager instrumentation | (Experimental ⚠️ ) Enables us to use Elastic APM to trace issues across Rules and Actions 🎉 [for the record, this was an initiative by the awesome @dgieselaar ] |
#98624 | [Task Manager] Workload aggregation caps out at 10k tasks | Adds more detailed information into our health monitoring, giving us a full picture of the workload system |
#98796 | Status page returns 503 when alerting cannot find its health task in Task Manager | This was a bug which caused our health monitoring to be unreliable at certain points, felt like we had to address this asap |
#99930 | Alerting health API only considers rules in the default space | Same as above - a bug that impacted our health monitoring and priority wasn't even up for debate :) |
#101227 | [alerting] log warning when alert tasks are disabled due to saved object not found | Adds more detailed information into why a rule task might have failed. At the moment we don't actually know when a missing SO has caused a task to fail. |
#87055 | Issues with alert statuses UI | UX improvements that we hope will enable customers to help themselves, freeing us up to respond to other customers with more complex issues |
https://github.com/elastic/kibana/issues/101505 | Gain insight into task manager health apis when a problem occurs | Enables us to correlate between Task Manager health stats and issues identified by customers, as long as they have debug logging enabled at the time |
https://github.com/elastic/kibana/issues/101507 | Improve event log data to include when the rule execution starts | Enables us to correlate between rule execution and issues identified by customers |
#99225 | [event log] add rule to event log shared object for provider: actions action: execute log event | should dramatically reduce the time spent identifying the root cause of action failures |
These issues are aimed for inclusion the nearing possible minor release, but this depends on progress made by feature freeze.
Issue | Title | Why have we prioritised this |
---|
Per https://github.com/elastic/kibana/issues/98902#issuecomment-860851209, we shipped everything we aimed to ship for the first phase of this effort so closing this ticket
Epic Name
Observability of the alerting framework phase 1
Background
The RAC initiative will drastically increase the adoption of alerting. With an increase in adoption, there will also be an increase in rules the alerting framework will handle. This increase can cause the overall alerting framework to behave in unexpected ways, and it currently takes a lot of steps to identify the root cause.
User Story / Problem Statement(s)
As a Kibana administrator, I can quickly identify root causes when the alerting framework isn't behaving properly. As a Kibana developer, I can see insight into the performance impact my rule type has.
Success Criteria
An initial step at reduced times for Kibana administrators to find root causes of framework misbehaviour. An initial step at providing insights to developers about their rule types.
Proposal
See https://github.com/elastic/kibana/issues/98902#issuecomment-855003936
The agreed upon proposal from the above comment yielded these two tickets:
These issues should be considered part of this effort, as they will help tell a better performance story from an event log perspective:
Related issues