We've had issues for a while with the CPU Usage rule causing false negatives. An attempt was made to correct this but that did three things:
a) Introduce new error cases that generate noise
b) Highlighted areas where the data model makes it hard to evaluate the rule state
c) Showed that the domain is more complex than the rule currently accounts for (such as autoscaling, or non-container based cgroup environments)
The formula itself is simple but to be precise it relies on the limit to be fixed, which isn't always the case. If we use the same limit across two segments where the limit has changed, we're likely to either underreport or overreport the usage, which depending on the range the rule is looking at could be very bad.
This means we need to be able to query for all the "limit segments" within the lookback window, calculate the average per segment and then take the average of all segments.
Perhaps even more sophisticated would be to have the app mark out when these limits change in the UI etc.
As for the noise, we made some assumptions about where cgroups are used, namly only in containerized environments (that's how the code is currently worded) but this isn't true, cgroups are widely used in other setups so the rule needs to accommodate for that. Ideally we would also be able to alert on both cgroup and non-cgroup based setups with the same rule to allow people to have mixed environments.
In addition to changing how we resolve the results for the query, we also need better tools for when the rule executor faces issues it cannot work around that leave the rule "broken". Ideally these would not trigger on the first failure but would kick in after a few repeated failures. Further, these actions need to be separate from the normal rule actions so that users can decide if that is something that should ping an SRE that is on call or not.
Summary
We've had issues for a while with the CPU Usage rule causing false negatives. An attempt was made to correct this but that did three things: a) Introduce new error cases that generate noise b) Highlighted areas where the data model makes it hard to evaluate the rule state c) Showed that the domain is more complex than the rule currently accounts for (such as autoscaling, or non-container based cgroup environments)
The formula itself is simple but to be precise it relies on the limit to be fixed, which isn't always the case. If we use the same limit across two segments where the limit has changed, we're likely to either underreport or overreport the usage, which depending on the range the rule is looking at could be very bad. This means we need to be able to query for all the "limit segments" within the lookback window, calculate the average per segment and then take the average of all segments. Perhaps even more sophisticated would be to have the app mark out when these limits change in the UI etc.
As for the noise, we made some assumptions about where cgroups are used, namly only in containerized environments (that's how the code is currently worded) but this isn't true, cgroups are widely used in other setups so the rule needs to accommodate for that. Ideally we would also be able to alert on both cgroup and non-cgroup based setups with the same rule to allow people to have mixed environments.
In addition to changing how we resolve the results for the query, we also need better tools for when the rule executor faces issues it cannot work around that leave the rule "broken". Ideally these would not trigger on the first failure but would kick in after a few repeated failures. Further, these actions need to be separate from the normal rule actions so that users can decide if that is something that should ping an SRE that is on call or not.
Links
Attempted fixed: https://github.com/elastic/kibana/pull/159351 https://github.com/elastic/kibana/pull/167244
Revert: https://github.com/elastic/kibana/pull/172913
Original issues: https://github.com/elastic/kibana/issues/116128 https://github.com/elastic/kibana/issues/160905
Internal issues: https://github.com/elastic/sdh-beats/issues/4082 https://github.com/elastic/sdh-kibana/issues/4299 https://github.com/elastic/sdh-kibana/issues/4122 https://github.com/elastic/sdh-kibana/issues/4158 https://github.com/elastic/sdh-kibana/issues/4299 https://github.com/elastic/sdh-beats/issues/4082 https://github.com/elastic/sdh-kibana/issues/3329 https://github.com/elastic/sdh-kibana/issues/2759 https://github.com/elastic/sdh-kibana/issues/2860 https://github.com/elastic/sdh-kibana/issues/3069 https://github.com/elastic/sdh-kibana/issues/3117 https://github.com/elastic/sdh-kibana/issues/3768 https://github.com/elastic/sdh-kibana/issues/2436 https://github.com/elastic/sdh-kibana/issues/4695 https://github.com/elastic/sdh-kibana/issues/4875