elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.69k stars 8.12k forks source link

[Observability] Inventory rule doesn't alert on "Alert me if there's no data" #165421

Open TheRiffRafi opened 1 year ago

TheRiffRafi commented 1 year ago

Kibana version: 8.9.1

Elasticsearch version: 8.9.1

Describe the bug:

When configuring an inventory rule in Observability and setting the option "Alert me if there's no data" the rule doesn't generate an alert if there is no data. Selecting to alert on "Status Change" or "Checks interval" has no effect on whether "no data" is reported or not.

Steps to reproduce:

  1. Get environment setup (Metricbeat with system module)
  2. Go to Observability - Alerts - Create rule - Inventory Type.
  3. Configure any threshold.
  4. Configure "Alert me if there's no data"
  5. Enable email notification.
  6. Observe alerting on set threshold value.
  7. Stop metricbeat.
  8. Observe that there is no alert on "no data".

Expected behavior:

A notification should be received if there is no data received.

Any additional context: Tested the same steps in version 7.17.9 and issue was not reproducible, alerting notifies on "no data" when metricbeat is stopped. For version 8.x I've only tested on latest (8.9.1) and 8.8.2.

elasticmachine commented 12 months ago

Pinging @elastic/unified-observability (Team:Observability)

elasticmachine commented 6 months ago

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

elasticmachine commented 6 months ago

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

mgiota commented 6 months ago

@TheRiffRafi I am gonna take a look and come back with more answers.

jasonrhodes commented 6 months ago

For reference, the inventory rule communicates this option like this:

Screenshot 2024-02-27 at 3 32 47 PM

Whereas the metric threshold and custom threshold rules communicate their similar option differently:

Screenshot 2024-02-27 at 3 38 49 PM

Based on the language used, would it be okay for the Inventory Rule to trigger this alert only when there are zero documents returned overall (I'm not sure what the "or if the alert fails to query Elasticsearch" is meant to get at)? We probably want @vinaychandrasekhar to weigh in from the product perspective on this, and on whether we need to continue offering this option in the Inventory Rule in the first place.

renangenova commented 6 months ago

Thank you for the continuation of this bug - I've created a KB article for visibility: https://support.elastic.dev/knowledge/view/f7e0ba8d

maryam-saeidi commented 6 months ago

@jasonrhodes For the metric threshold, we have 2 no data settings, one for overall, and one for missing group. I think the one that you shared for the inventory rule is similar to the overall setting in the metric threshold one:

It is important to note that

  1. We removed the overall setting in the custom threshold UI to simplify the UI.
  2. If I remember correctly, in metric and custom threshold, the overall setting does not work if we add a group, in that case, the missing group setting is the one that applies.
jasonrhodes commented 6 months ago

Makes sense, @maryam-saeidi, thanks for those explanations.

jasonrhodes commented 6 months ago

What's the level of effort involved in making this work roughly as expected for the inventory rule?

@vinaychandrasekhar we should talk about options re: this no data scenario for the inventory rule (and possibly for the other rules).

maryam-saeidi commented 6 months ago

@jasonrhodes I think this functionality is not the best way to solve the underlying issue (related to the availability of a service or related data) and we need to solve it at a different level (meaning rule level). We previously had a discussion with ResponseOps to have similar functionality for all the rules, not only the infra-related ones (inventory/metric threshold/custom threshold). Here is the outcome of the previous discussion. This will also cause an issue when we send notifications as we don't have a separate recovery notification per different groups of triggering alerts (alert/no data/warning).

My suggestion is to focus on introducing this functionality for all the rules, meaning in case of not having data, the rule will be in a warning state since nothing about the condition related to this alert is wrong, we don't have any data to draw that conclusion and this is relevant for any rule, not only infra-related ones and remove/deprecate this logic per rule.

Also, I can imagine 2 different teams being responsible for handling this issue:

  1. a team responsible for monitoring data ingestion and infrastructure.
  2. an app-level team that knows about the services and how to monitor them.
jasonrhodes commented 6 months ago

That sounds very reasonable as a general way forward, but I'd like to hear from product about how comfortable they are with just removing the checkbox from existing rules. If we can't remove it and can only deprecate it, I think we should fix it so that it at least "works" as best as it can in the current context.

vinaychandrasekhar commented 6 months ago

I discussed this with @jasonrhodes . Do we know what the Level Of Effort is for fixing this on the Inventory threshold rule? If small_ish), we should discuss and get this on our team backlog to fix. If it's a large effort, let's chat live.

I agree with Maryam above that the longer term fix is to treat this need to "alert on no data" as a separate use case and related to, but separate from the day to day monitoring and alerting needs around thresholds and inventory monitoring and such. In addition, the (separate) alert will help SREs plan and manage the lack of data with things like automated baselining, analytics and visualizations etc. in addition to "just" alerting.