elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.69k stars 8.12k forks source link

[APM] Change rule creation for anomaly detection detectors (latency, throughput, and failed transaction rate) #126580

Closed formgeist closed 8 months ago

formgeist commented 2 years ago

Summary

The current experience around creating anomaly detection rules for alerts is targeted only for the latency detector, but we have recently added new detectors for the anomaly detection jobs that include throughput and failed transaction rate.

We should decide whether we should include a single anomaly detection rule that will contain all three options as conditions or have separate rule types for each detector.

CleanShot 2022-03-01 at 13 46 43@2x

Solution

Convert the existing latency anomaly rule to a global anomaly rule for all or a single detector(s)

CleanShot 2022-03-02 at 11 03 18@2x CleanShot 2022-03-02 at 11 03 49@2x CleanShot 2022-03-02 at 11 44 24@2x
elasticmachine commented 2 years ago

Pinging @elastic/apm-ui (Team:apm)

alex-fedotyev commented 2 years ago

@formgeist - could you please create a quick mock how alert creation would look like in both of these flows, i.e. a single anomaly detection rule that will contain all three options as conditions or have separate rule types for each detector?

formgeist commented 2 years ago

@alex-fedotyev I've updated the issue description with the two options. I'm heavily leaning towards Solution B because it means the most flexibility in its use and being able to set a global rule for all anomalies that are detected.

@dgieselaar are there any limitations or challenges with either approach?

cc @sqren @chrisdistasio

sorenlouv commented 2 years ago

Solution B sound good to me. Thanks for getting around to this so quickly @formgeist !

chrisdistasio commented 2 years ago

@vinaychandrasekhar do you have any thoughts on this?

vinaychandrasekhar commented 2 years ago

@chrisdistasio thanks for including me in the discussion. @formgeist couple of questions -

chrisdistasio commented 2 years ago

+1 @vinaychandrasekhar -- we should use this as an opty to drive consistency in behavior/pattern across the o11y apps.

dgieselaar commented 2 years ago

@formgeist agree w/ @sqren that B sounds like our best option. It'd be great if we can get that in for 8.2, as it fixes the bug (or rather makes that behaviour explicit) where it fires for all detector types.

formgeist commented 2 years ago

Would it make sense to also combine the multiple threshold rules under a single "Create threshold rule" with a second level menu for failed transaction rule, error count rule? Otherwise, it appears confusing (to me) that anomaly rule creation shows no notion of signal or metric (i.e., latency, transaction rate, errors) where as the threshold ones do

We originally intended the rule grouping to reflect the fact that Latency had anomaly and threshold rules types, while the others (throughput and failed transaction rate) only supported threshold based rules. Now that this is no longer the issue, I agree a re-organization makes sense.

Other o11y apps where the alerts and rules drop down is shown seem to follow a pattern more similar to solution A above. Would like to check if there's a top-down pattern consistency already considered across o11y apps?

Agreed, there's an opportunity to review this across Observability too. I've created a separate issue for this.

I would like us to focus on supporting the new anomaly detectors in the upcoming release, so let's narrow the scope down to change the anomaly rule creation. I'll create the necessary ticket(s) so we can include this in our plans.

cc @dannycroft

formgeist commented 2 years ago

Created a related issue to change the structure of rules in the Alerts and rules option for APM https://github.com/elastic/kibana/issues/126757

MiriamAparicio commented 9 months ago

Hi, I'm going to start working on this. For me to understand and to confirm as the issue is quite all We want to add the 'metric' detector to the creation of the rule, keeping as default ALL, and display suggestions as what we have for the type with the 3 different connectors (latency, throughput, or failed transaction rate)

image

Do we change also this copy? Alert when either the latency, throughput, or failed transaction rate of a service is anomalous. Learn more

cc @formgeist @boriskirov @sqren

boriskirov commented 9 months ago

Yes, that sounds good, adding a detector for the different available metrics and selecting the all by default.

sorenlouv commented 9 months ago

We want to add the 'metric' detector to the creation of the rule, keeping as default ALL, and display suggestions as what we have for the type with the 3 different connectors (latency, throughput, or failed transaction rate)

@MiriamAparicio I suggest making it a checkbox (multiselect). Let's avoid the "All" option and instead default to having all three options pre-selected.

MiriamAparicio commented 9 months ago

@sqren I already have a PR up, and what you suggested would be very different to what we have for all other rules