Closed fkanout closed 2 years ago
The interval
for the Metric Threshold Rule is the combination of timeSize
and timeUnit
which is passed to the buildFiredAlertReason
as part of the alertResult
object.
The problem with the alertResult
in this method is that not all the values being passed in are represented by the type, only the values they are using in the method. If you look at buildNoDataAlertReason
you can see the timeSize
and timeUnit
values are represented there. The build reason methods are all getting an alertResult
object which looks like this in "real life":
{
aggType: 'rate',
comparator: '>',
threshold: [1],
timeSize: 2,
timeUnit: 'm',
metric: 'system.network.in.bytes',
currentValue: 1333.3333333333333,
timestamp: '2022-01-24T18:41:07.868Z',
shouldFire: [true],
shouldWarn: [false],
isNoData: [false],
isError: false
}
@jasonrhodes At some point we need to refactor the Metric Threshold Rules to use concrete types instead of passing around these objects and typing just the parts we use. As it stands right now, I found myself always having to console.log()
objects in these rules because there isn't a clear type, which kind of defeats the benefits of Typescript.
Metric anomaly
is not used at the moment, it was disabled a long time ago and that's why when registering infra rule types to RAC, we didn't RAC register metric anomaly rule type at all. @simianhacker Are you maybe aware why metric anomaly is disabled at the moment?
@vinaychandrasekhar If we want to bring back the Metric anomaly rule type, we should prioritize it accordingly and create a ticket to RAC register it.
@vinaychandrasekhar @katrin-freihofner What about the recovery message? Here's current format:
{metric} is now {comparator} a threshold of {threshold} (current value is {currentValue}) for {group}
@mgiota good catch, thanks! @hbharding could you please help with the format?
@vinaychandrasekhar Before moving on with this change from ARRAY OF system.process.cpu.total.pct is greater than a threshold of 50 (current value is 70.8% for * ...
to Multiple metrics match the condition in the last 5 min for {“all hosts” OR groupName}.
, I would like to note here that currently threshold and current value appear only in the reason message and nowhere else in the flyout details (see screenshot below).
Do we agree to pause only this specific change (case of multiple conditions) and keep it as it is at the moment, until we implement Store arbitrary expected and actual value results in the alerts-as-data indices ? I will prioritize this ticket and come up with a suggestion so that we can start working on storing threshold and current value.
@mgiota Thanks for checking, I agree with your assessment.
cc @hbharding, @katrin-freihofner in case they have a different opinion.
@mgiota thank you, I agree.
Summary
Related to #117697
After reviewing our current alert reason mesages, we want to improve and standardize our alert reason messages across Observability. The messages should be limited in length while conveying the same bits of important information. When applicable, reason messages should convey the following parts:
For consistency, all messages should use capitalization and end with periods.
Changes
We've documented our current messages and what we'd like to change them to in the tables below for each app.
Metrics
*
...Notes: