elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.48k stars 8.04k forks source link

[Obs Alerting] [Discuss] Show all "observability" and "stack" rules and alerts in the Observability UI #186083

Open jasonrhodes opened 3 weeks ago

jasonrhodes commented 3 weeks ago

NOTE: THIS ISSUE IS A DRAFT IN PROGRESS.

In the interest of breaking down the virtual "barrier" between so-called "observability" rules and alerts and "stack" rules and alerts, I'd like to suggest we start showing all of these rules and alerts in the Observability UI, encouraging users to use rule tagging to control which alerts are shown in the resulting alerts table.

Context

The alerts table in the Observability > Alerts page uses the <AlertsStateTable> component to pull in alerts that match a passed in list of pre-selected "feature IDs" (which, behind the scenes, are mapped to values known as "producer" and "consumer").

Screenshot 2024-06-11 at 3 53 17 PM

Note: It's unclear why kibana.alert.rule.type, which I added here to demonstrate the rule type for the given alert, does not produce any value.

The list of feature IDs that we use to filter this set of alerts is this:

AlertConsumers.APM,
AlertConsumers.INFRASTRUCTURE,
AlertConsumers.LOGS,
AlertConsumers.UPTIME,
AlertConsumers.SLO,
AlertConsumers.OBSERVABILITY,

(Source)

The main feature IDs we leave out of this list at the moment are one known as "MONITORING" (for explicit Stack Monitoring rule types) and one known as "STACK_ALERTS", the latter of which would bring in alerts with a "stack_alerts" producer/consumer pair. At the moment, this refers to a list of rule types that are registered by the Response Ops team's code, e.g. the ES Query rule, whenever a rule of that type is created in the Stack Management section of the Kibana app.

Screenshot 2024-06-11 at 4 11 15 PM Screenshot 2024-06-11 at 4 04 14 PM

Problem

The problem with the current MONITORING and STACK_ALERTS feature IDs and producer/consumer values is that they mix up two different concepts:

  1. Elastic Stack Monitoring rule types - these are rule types that are meant to explicitly monitor the Elastic stack. This would include all of the rule types within the MONITORING feature ID, as well as the "Transform Health" rule type contained within the STACK_ALERTS feature ID.
  2. STACK_ALERTS rule types - these are rule types that happen to be registered by the "stack", i.e. the response ops Kibana plugins. This category includes the above-mentioned "Transform Health" rule type, but it also includes "Elasticsearch Query", "Index Threshold", and "Tracking Containment" rule types.

In reality, I think we have three different rule types that are available for our customers to use.

  1. Elastic Stack Monitoring rule types (see above number 1)
  2. Generic rule types - these are rule types that allow customers to build extremely flexible rules that use Elasticsearch queries to produce complicated rule scenarios. This includes "Elasticsearch Query", "Index Threshold", and "Tracking Containment" from the STACK_ALERTS feature ID as well as "Custom Threshold" from the OBSERVABILITY feature ID and, to some degree, the "Metric Threshold" and "Log Threshold" rule types from the INFRASTRUCTURE and LOGS feature IDs, respectively.
  3. Specialized observability rule types - these are the rules that have been carefully set up to query observability data for a customer to use to monitor the applications and infrastructure that they are observing with the Elastic observability toolset, e.g. all of the APM rule types, all of the synthetics and uptime rule types, etc.

Because the STACK_ALERTS feature ID mixes together two of these categories (Transform Health from the first category and Elasticsearch Query, Index Threshold, and Tracking Containment from the second category), omitting all alerts created by STACK_ALERTS rule types leads to a confusing situation where alerts are omitted from the observability alerts table for seemingly no discernible reason.

The "consumer" value as a fix

There is somewhat of a fix for this problem today, and that is the fact that every rule that is instantiated has both a "producer" value and a "consumer" value. The "producer" value is static per rule type and represents where this rule type is registered in Kibana code. For example, the infra app registers the "Metric Threshold" rule and the "Log Threshold" rule, and each are given a static "producer" value (INFRASTRUCTURE and LOGS, respectively).

However, the "consumer" value for a given rule is set based on where the rule was created. In other words, if you create a metric threshold rule from the Observability rules page (/app/observability/alerts/rules), its consumer will be set to "infrastructure" (copied from its producer). However, if you create that same rule from the Stack Management rules page (/app/management/insightsAndAlerting/triggersActions/rules)

NOTE: stopped here to confirm the above and ran into some issues, will clarify

elasticmachine commented 3 weeks ago

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)