elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.6k stars 8.21k forks source link

[RAC][Metrics] Review readability of Alert Reason Messages #121272

Closed fkanout closed 2 years ago

fkanout commented 2 years ago

Summary

Related to #117697

After reviewing our current alert reason mesages, we want to improve and standardize our alert reason messages across Observability. The messages should be limited in length while conveying the same bits of important information. When applicable, reason messages should convey the following parts:

  1. Metric - The name of the metric. E.g. “CPU usage”, “Log entries”, “Latency”, etc.
  2. Violation - Why the condition triggered. e.g. X > Y or X = true
  3. Current value - The value of the metric when the rule was run.
  4. Expected value - The expected value of the metric. E.g. the threshold.
  5. Interval - Over what period of time? e.g. “in the last 5 min”
  6. Affected entity - The name or group of the entity where the violation occurred.

For consistency, all messages should use capitalization and end with periods.

Changes

We've documented our current messages and what we'd like to change them to in the tables below for each app.

Metrics

Current Change to
1 CPU usage is greater than a threshold of 50 (current value is 70.8%) for * CPU usage is 70.8% in the last 5 min for all hosts. Alert when > 50%.
2 system.process.cpu.total.pct is greater than a threshold of 50 (current value is 70.8% for gke-edge-oblt-default-pool-350b44de-34vp. Metric is 70.8% in the last 5 min for gke-edge-oblt-default-pool-350b44de-34vp. Alert when > 50%.
3 ARRAY OF system.process.cpu.total.pct is greater than a threshold of 50 (current value is 70.8% for * ... 3 metrics match the condition in the last 5 min for {“all hosts” OR groupName}.
4 CPU usage has reported no data over the past 5m for * CPU usage reported no data in the last 5 min for {“all hosts” OR groupName}.
5 system.process.cpu.total.pct has reported no data over the past 5m for * Metric reported no data in the last 5 min for {“all hosts” OR groupName}.
6 Memory usage 5x lower A critical memory usage anomaly with a score of 96 was detected in the last 5 min for {“all hosts” OR groupName}.

Notes:

  1. The threshold is missing a % sign. Add this when applicable.
  2. Field names are hard to read. When user selects a metric which can only be represented by it's field name, use the word "Metric" instead.
  3. In the case of multiple metrics, we currently repeat the same sentence structure for each metric which leads to very long alert messages. Instead, just say "multiple metrics"
  4. Change "over the past" to "in the last"
  5. Again, don't show field names. Just say "Metric" if a name is not available
  6. Our Metric anomaly alerts should look similar to our APM anomaly type alerts.
simianhacker commented 2 years ago

The interval for the Metric Threshold Rule is the combination of timeSize and timeUnit which is passed to the buildFiredAlertReason as part of the alertResult object.

https://github.com/elastic/kibana/blob/d6af6b947cb8141170373143b143f9844bd9a963/x-pack/plugins/infra/server/lib/alerting/common/messages.ts#L111-L128

The problem with the alertResult in this method is that not all the values being passed in are represented by the type, only the values they are using in the method. If you look at buildNoDataAlertReason you can see the timeSize and timeUnit values are represented there. The build reason methods are all getting an alertResult object which looks like this in "real life":

{
  aggType: 'rate',
  comparator: '>',
  threshold: [1],
  timeSize: 2,
  timeUnit: 'm',
  metric: 'system.network.in.bytes',
  currentValue: 1333.3333333333333,
  timestamp: '2022-01-24T18:41:07.868Z',
  shouldFire: [true],
  shouldWarn: [false],
  isNoData: [false],
  isError: false
}

@jasonrhodes At some point we need to refactor the Metric Threshold Rules to use concrete types instead of passing around these objects and typing just the parts we use. As it stands right now, I found myself always having to console.log() objects in these rules because there isn't a clear type, which kind of defeats the benefits of Typescript.

mgiota commented 2 years ago

Metric anomaly is not used at the moment, it was disabled a long time ago and that's why when registering infra rule types to RAC, we didn't RAC register metric anomaly rule type at all. @simianhacker Are you maybe aware why metric anomaly is disabled at the moment?

@vinaychandrasekhar If we want to bring back the Metric anomaly rule type, we should prioritize it accordingly and create a ticket to RAC register it.

mgiota commented 2 years ago

@vinaychandrasekhar @katrin-freihofner What about the recovery message? Here's current format:

{metric} is now {comparator} a threshold of {threshold} (current value is {currentValue}) for {group}

vinaychandrasekhar commented 2 years ago

@mgiota good catch, thanks! @hbharding could you please help with the format?

mgiota commented 2 years ago

@vinaychandrasekhar Before moving on with this change from ARRAY OF system.process.cpu.total.pct is greater than a threshold of 50 (current value is 70.8% for * ... to Multiple metrics match the condition in the last 5 min for {“all hosts” OR groupName}., I would like to note here that currently threshold and current value appear only in the reason message and nowhere else in the flyout details (see screenshot below).

Do we agree to pause only this specific change (case of multiple conditions) and keep it as it is at the moment, until we implement Store arbitrary expected and actual value results in the alerts-as-data indices ? I will prioritize this ticket and come up with a suggestion so that we can start working on storing threshold and current value.

Screenshot 2022-01-27 at 13 34 53
vinaychandrasekhar commented 2 years ago

@mgiota Thanks for checking, I agree with your assessment.

cc @hbharding, @katrin-freihofner in case they have a different opinion.

katrin-freihofner commented 2 years ago

@mgiota thank you, I agree.