formgeist commented 2 years ago

Summary

The current experience around creating anomaly detection rules for alerts is targeted only for the latency detector, but we have recently added new detectors for the anomaly detection jobs that include throughput and failed transaction rate.

We should decide whether we should include a single anomaly detection rule that will contain all three options as conditions or have separate rule types for each detector.

Solution

Convert the existing latency anomaly rule to a global anomaly rule for all or a single detector(s)

Remove/replace the existing latency anomaly rule with a global anomaly rule
The new anomaly rule takes a condition of all or a single detector, so the user doesn't have to choose individual rule types for the various anomalies that might occur.
By enabling the user to selected a single detector we keep the existing functionality of being able to select e.g. latency as the detector and determining how to alert on any severity level.
By enabling the user to choose actions on all detector anomalies with the same severity level enables them to have fewer rules that serve as a catch-all
Add detector ("metric") option to the conditions of the rule and replace wording around latency anomaly.
Update the default rule name to "Anomaly alert | $service.name"

elasticmachine commented 2 years ago

Pinging @elastic/apm-ui (Team:apm)

alex-fedotyev commented 2 years ago

@formgeist - could you please create a quick mock how alert creation would look like in both of these flows, i.e. a single anomaly detection rule that will contain all three options as conditions or have separate rule types for each detector?

formgeist commented 2 years ago

@alex-fedotyev I've updated the issue description with the two options. I'm heavily leaning towards Solution B because it means the most flexibility in its use and being able to set a global rule for all anomalies that are detected.

@dgieselaar are there any limitations or challenges with either approach?

cc @sqren @chrisdistasio

sorenlouv commented 2 years ago

Solution B sound good to me. Thanks for getting around to this so quickly @formgeist !

chrisdistasio commented 2 years ago

@vinaychandrasekhar do you have any thoughts on this?

vinaychandrasekhar commented 2 years ago

@chrisdistasio thanks for including me in the discussion. @formgeist couple of questions -

Would it make sense to also combine the multiple threshold rules under a single "Create threshold rule" with a second level menu for failed transaction rule, error count rule? Otherwise, it appears confusing (to me) that anomaly rule creation shows no notion of signal or metric (i.e., latency, transaction rate, errors) where as the threshold ones do
Other o11y apps where the alerts and rules drop down is shown seem to follow a pattern more similar to solution A above. Would like to check if there's a top-down pattern consistency already considered across o11y apps?

chrisdistasio commented 2 years ago

+1 @vinaychandrasekhar -- we should use this as an opty to drive consistency in behavior/pattern across the o11y apps.

dgieselaar commented 2 years ago

@formgeist agree w/ @sqren that B sounds like our best option. It'd be great if we can get that in for 8.2, as it fixes the bug (or rather makes that behaviour explicit) where it fires for all detector types.

formgeist commented 2 years ago

Would it make sense to also combine the multiple threshold rules under a single "Create threshold rule" with a second level menu for failed transaction rule, error count rule? Otherwise, it appears confusing (to me) that anomaly rule creation shows no notion of signal or metric (i.e., latency, transaction rate, errors) where as the threshold ones do

We originally intended the rule grouping to reflect the fact that Latency had anomaly and threshold rules types, while the others (throughput and failed transaction rate) only supported threshold based rules. Now that this is no longer the issue, I agree a re-organization makes sense.

Other o11y apps where the alerts and rules drop down is shown seem to follow a pattern more similar to solution A above. Would like to check if there's a top-down pattern consistency already considered across o11y apps?

Agreed, there's an opportunity to review this across Observability too. I've created a separate issue for this.

I would like us to focus on supporting the new anomaly detectors in the upcoming release, so let's narrow the scope down to change the anomaly rule creation. I'll create the necessary ticket(s) so we can include this in our plans.

cc @dannycroft

formgeist commented 2 years ago

Created a related issue to change the structure of rules in the Alerts and rules option for APM https://github.com/elastic/kibana/issues/126757

MiriamAparicio commented 9 months ago

Hi, I'm going to start working on this. For me to understand and to confirm as the issue is quite all We want to add the 'metric' detector to the creation of the rule, keeping as default ALL, and display suggestions as what we have for the type with the 3 different connectors (latency, throughput, or failed transaction rate)

Do we change also this copy? Alert when either the latency, throughput, or failed transaction rate of a service is anomalous. Learn more

cc @formgeist @boriskirov @sqren

boriskirov commented 9 months ago

Yes, that sounds good, adding a detector for the different available metrics and selecting the all by default.

sorenlouv commented 9 months ago

We want to add the 'metric' detector to the creation of the rule, keeping as default ALL, and display suggestions as what we have for the type with the 3 different connectors (latency, throughput, or failed transaction rate)

@MiriamAparicio I suggest making it a checkbox (multiselect). Let's avoid the "All" option and instead default to having all three options pre-selected.

MiriamAparicio commented 9 months ago

@sqren I already have a PR up, and what you suggested would be very different to what we have for all other rules

elastic / kibana

[APM] Change rule creation for anomaly detection detectors (latency, throughput, and failed transaction rate) #126580

Summary

Solution

Convert the existing latency anomaly rule to a global anomaly rule for all or a single detector(s)