[APM] Alerts for throughput and failure rate anomalies

elastic / kibana

Your window into the Elastic Stack

https://www.elastic.co/products/kibana

Other

19.68k stars 8.23k forks source link

[APM] Alerts for throughput and failure rate anomalies #159288

Open sorenlouv opened 1 year ago

sorenlouv commented 1 year ago

Today we allow users to create anomaly detection jobs (ML Jobs) which will produce anomaly results for latency, throughput and failure rates. Users can create rules and be alerted when there are anomalies for latency but they have no way of doing the same for throughput and failure rate anomalies.

There is a ruled called ApmRuleType.Anomaly and the user facing description for this rule is:

Alert when either the latency, throughput, or failed transaction rate of a service is anomalous.

This is quite misleading because it does in fact not produce alerts for throughput or failed transaction rate. Only latency as can be seen in the terms filter below:

https://github.com/elastic/kibana/blob/7890be623cd26dd10b4b42e1f9ed7b1f112e97da/x-pack/plugins/apm/server/routes/alerts/rule_types/anomaly/register_anomaly_rule_type.ts#L172-L175

Solution

It should be possible to receive alerts for throughput and failure rate anomalies. Instead of creating new rules the existing ApmRuleType.Anomaly rule should be updated to also produce alerts for other types of anomalies than latency.

Related enhancement request: https://github.com/elastic/enhancements/issues/12409 (internal)

elasticmachine commented 1 year ago

Pinging @elastic/apm-ui (Team:APM)

gbamparop commented 1 year ago

@elastic/apm-pm do you think that these should be separate rule types or just the one we currently have that will alert on latency, throughput and failed transaction rate?

katrin-freihofner commented 1 year ago

@sqren I agree, the name and description are misleading. We have plans to add the Anomaly rule (currently only available in Stack management) to Observability.

Do you know why there is a separate "APM anomaly" rule and how it is different from the one in Stack management? I think as the ML job covers latency, throughput, and failure rates the Anomaly detection rule in Stack management would be able to alert on all three.

sorenlouv commented 1 year ago

Do you know why there is a separate "APM anomaly" rule and how it is different from the one in Stack management?

I'm not 100% sure but APM has a rule called "Anomaly" and I see another one under Stack management called "Anomaly detection alert" - I assume that's the one your are referring to.

APM: Anomaly rule

The APM Anomaly rule will break down alerts by service.name, service.environment and transaction.type - similar to how all APM rules work. This means that a user will know exactly which service was anomalous and caused an alert. Furthermore, they can choose to only receive alerts for specific services/environments etc:

Machine Learning: "Anomaly detection alert"

This rule asks the user to select an existing ML job, and then specify a result type. This is much more generic but also quite a bit harder to understand how to use. Furthermore, it's not possible to group or filter by service.name / service.environment / transaction.type afaict.

elasticmachine commented 1 year ago

Pinging @elastic/actionable-observability (Team: Actionable Observability)

akhileshpok commented 1 year ago

@gbamparop - I would suggest that we re-use and extend the capabilities of the existing APM anomaly rule. We should make sure that the threshold settings/ranges are appropriate for the new metrics.

sorenlouv commented 1 year ago

I would suggest that we re-use and extend the capabilities of the existing APM anomaly rule.

Agree, that's also what I've suggested in the issue description:

"Instead of creating new rules the existing ApmRuleType.Anomaly rule should be updated to also produce alerts for other types of anomalies than latency"

We should make sure that the threshold settings/ranges are appropriate for the new metrics.

Actually, we don't even need to think of this. The only metric the rule cares about is severity. Meaning a severity like "critical" can apply to both latency anomalies, throughput anomalies and failure rate anomalies.