elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.83k stars 8.21k forks source link

[AO][SERVERLESS][BUG] Custom Threshold rule preview chart is not working after infra plugin disabled #166851

Closed fkanout closed 11 months ago

fkanout commented 1 year ago

📝 Summary

Update: Related to https://github.com/elastic/kibana/issues/167390

In the rule creation flyout, we use the endpoint /api/infra/metrics_explorer to render the chart.

However, as the Infra plugin is no longer available in the serverless offering, by extension, that endpoint is no longer available, which leads to having a chart with no data while there is data.

Screenshot 2023-09-20 at 16 52 26

✅ AC

elasticmachine commented 1 year ago

Pinging @elastic/actionable-observability (Team: Actionable Observability)

maryam-saeidi commented 1 year ago

@fk Since you picked this ticket, I will share some information based on the discussion with @simianhacker. He proposed four options:

  1. Explore using a Lens embeddable, but it might not work.
  2. Move Infra's Metrics Explorer API endpoint to a shared data plugin (based on Jason's Tiered Plugin architecture)
  3. Copy the Infra's Metrics Explorer API into our plugin (which means we own it)
  4. Just create a new API that takes the rule params and returns the timeseries data. This option would allow us to re-use all the code that generates the aggregation that we currently maintain for the executor.

He also pointed out that the first option does not work, mainly because of the boolean logic we expose.

Imagine you had this:

  • A - average of 'system.cpu.user.pct'
  • B - average of 'system.cpu.system.pct`
  • C - max of system.cpu.cores The equation in the rule might be C > 1 ? (A + B) / C : A + B (ignore the fact that there is no difference in dividing A + B by 1 or a number greater than 1). The equivilent Lens equation would be: ifelse(max(system.cpu.cores) > 1, (average(system.cpu.user.pct) + average(system.cpu.system.pct)) / max(system.cpu.cores), average(system.cpu.user.pct) + average(system.cpu.system.pct))

and Lens doesn't support && or || so option 1 is a NO GO

The second option also is not a good option as we want to deprecate the Infra rules in favor of the Custom threshold rule, so there is no need to share this logic. I briefly tried the third option, and it was not a clean and straightforward approach. It makes more sense to reimplement the logic in the observability plugin. Chris is also on the same page and he said:

We already have code that generates the aggregations and it's really just taking those aggregations and putting them under a date_histogram and formatting the response.

So, the best option in this case is option #4.

fkanout commented 1 year ago

@maryam-saeidi, thank you for sharing and putting these options together. I already discussed these options with Chris (you were faster and posted them here 💪🏻). Indeed, the 4th option seems the way to go (I'm still investigating it).

However, I'm still leaning toward using Lens, and I found a way to produce boolean logic using math and confirmed it with the Lens team. And they were kind enough to open an issue to add the boolean logic to the Lens formula.

chrisdistasio commented 1 year ago

@emma-raffenne infra plugin is soon to be enabled for serverless. with that being the case, does this work need to be done?

cc: @smith @roshan-elastic @neptunian

roshan-elastic commented 1 year ago

Hey @emma-raffenne @chrisdistasio - FYI we're not enabling any alerting in the Infra UI in serverless until we figure out how to get threshold rules working nicely for Infra:

I'm picking up with Vinay on how we can prioritise this but just sharing...

elasticmachine commented 11 months ago

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

maryam-saeidi commented 11 months ago

Findings related to a null value in equations:

I've added the following mapping and values to the high cardinality cluster: Mapping Values
image image
  1. When I search for a field in average or any other aggregation, I only see the field with the long type, not the boolean type. Also, from the number type fields, I only see the one that actually has a value:

  2. When checking for null, when I changed the value of numberNullField from 1 to null, I got a no data alert. Rule Alert
    image image
  3. I also checked the query including Elvis operator, ?:, and it didn't work as expected when the field was explicitly set to null:

    Neither when the field didn't exist: