Open sorenlouv opened 1 year ago
Pinging @elastic/apm-ui (Team:APM)
@elastic/apm-pm do you think that these should be separate rule types or just the one we currently have that will alert on latency, throughput and failed transaction rate?
@sqren I agree, the name and description are misleading. We have plans to add the Anomaly rule (currently only available in Stack management) to Observability.
Do you know why there is a separate "APM anomaly" rule and how it is different from the one in Stack management? I think as the ML job covers latency, throughput, and failure rates the Anomaly detection rule in Stack management would be able to alert on all three.
Do you know why there is a separate "APM anomaly" rule and how it is different from the one in Stack management?
I'm not 100% sure but APM has a rule called "Anomaly" and I see another one under Stack management called "Anomaly detection alert" - I assume that's the one your are referring to.
The APM Anomaly rule will break down alerts by service.name
, service.environment
and transaction.type
- similar to how all APM rules work. This means that a user will know exactly which service was anomalous and caused an alert. Furthermore, they can choose to only receive alerts for specific services/environments etc:
This rule asks the user to select an existing ML job, and then specify a result type. This is much more generic but also quite a bit harder to understand how to use. Furthermore, it's not possible to group or filter by service.name
/ service.environment
/ transaction.type
afaict.
Pinging @elastic/actionable-observability (Team: Actionable Observability)
@gbamparop - I would suggest that we re-use and extend the capabilities of the existing APM anomaly rule. We should make sure that the threshold settings/ranges are appropriate for the new metrics.
I would suggest that we re-use and extend the capabilities of the existing APM anomaly rule.
Agree, that's also what I've suggested in the issue description:
"Instead of creating new rules the existing ApmRuleType.Anomaly rule should be updated to also produce alerts for other types of anomalies than latency"
We should make sure that the threshold settings/ranges are appropriate for the new metrics.
Actually, we don't even need to think of this. The only metric the rule cares about is severity. Meaning a severity like "critical" can apply to both latency anomalies, throughput anomalies and failure rate anomalies.
Today we allow users to create anomaly detection jobs (ML Jobs) which will produce anomaly results for latency, throughput and failure rates. Users can create rules and be alerted when there are anomalies for latency but they have no way of doing the same for throughput and failure rate anomalies.
There is a ruled called
ApmRuleType.Anomaly
and the user facing description for this rule is:This is quite misleading because it does in fact not produce alerts for throughput or failed transaction rate. Only latency as can be seen in the terms filter below:
https://github.com/elastic/kibana/blob/7890be623cd26dd10b4b42e1f9ed7b1f112e97da/x-pack/plugins/apm/server/routes/alerts/rule_types/anomaly/register_anomaly_rule_type.ts#L172-L175
Solution
It should be possible to receive alerts for throughput and failure rate anomalies. Instead of creating new rules the existing
ApmRuleType.Anomaly
rule should be updated to also produce alerts for other types of anomalies than latency.Related enhancement request: https://github.com/elastic/enhancements/issues/12409 (internal)