elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
20.01k stars 8.25k forks source link

[Inventory] Alerts not matching to K8s entities #202355

Closed roshan-elastic closed 3 days ago

roshan-elastic commented 2 weeks ago

Description

Alerts grouped by the k8s entity ID are not showing against the k8s entities in the Inventory (or showing wrong):

Steps to Replicate

  1. Incorrect alerts showing

https://github.com/user-attachments/assets/689858b3-d85b-49b5-9dd7-d4fc98450378

  1. Alerts missing

https://github.com/user-attachments/assets/90bce184-e706-4dad-a804-553694fc7c71

elasticmachine commented 2 weeks ago

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

roshan-elastic commented 2 weeks ago

cc @kpatticha - You can test this on edge-logs if you want to quickly replicate

jennypavlova commented 1 week ago

Hi @roshan-elastic I am trying to understand the issue better now and tried it on the edge logs cluster.

First issue:

To summarize in some cases (like the k8s.cluster.ecs) we see fewer alerts in inventory and when we click on the alerts link we see more alerts because of the filter ( we have in the filter field only "edge-log" and on the rule page you showed the filter is orchestrator.cluster.name: "edge-log" ) so when we get the count we use the filter and when we navigate to the alerts we have only the value ( "edge-log" ) set in the search bar:

Inventory -> Alerts page after count click Is that the expected filter?
Image Image Image

I am asking this because I saw that we removed the field on purpose as part of https://github.com/elastic/kibana/pull/202188 (last table entry in the PR description) so do we want that back to fix this or there is something else I am missing here 🤔 ?


Second issue:

I checked the data and compared the rules and the only difference is that it is using the metrics dataview and not the logs one we have for the cluster rule. I am not super familiar with the custom threshold rules but it should work in theory with different dataviews (we have the correct mapping 'k8s.deployment.ecs' => 'kubernetes.deployment.name' in inventory so I assume something is wrong with the filtering. I tried to execute both queries to the alerts indices we use in inventory to get the alerts count and the host one for example returns the result but the deployment one doesn't:

Image

But on the alerts page I see the alerts:

Image

âš  Not with the field filter (kubernetes.deployment.name : *) tho:

Image

So maybe this is something to check with the @elastic/response-ops team 🤔 : This is the alert rule: logs cluster link) (it's also shown in the second video in the description.

The queries from the screenshot (click here) #no results GET .alerts-*/_search {"size":0,"track_total_hits":false,"query":{"bool":{"filter":[{"term":{"kibana.alert.status":"active"}}]}},"aggs":{"k8s.deployment.ecs":{"composite":{"size":500,"sources":[{"kubernetes.deployment.name":{"terms":{"field":"kubernetes.deployment.name"}}}]}}}} #returnes results GET .alerts-*/_search {"size":0,"track_total_hits":false,"query":{"bool":{"filter":[{"term":{"kibana.alert.status":"active"}}]}},"aggs":{"host":{"composite":{"size":500,"sources":[{"host.name":{"terms":{"field":"host.name"}}}]}}}}
roshan-elastic commented 1 week ago

Hey @jennypavlova,

cc @simianhacker

Thanks for picking this up.

First issue:

Ah - good catch here. This predates the simplification of the alerts filtering to filter by just the string of the entity name. Either way, I have no idea why the 'host' rules are matching the cluster as I'm unable to replicate this with other rules.

Here's a vid to show you:

Video

I have a hypothesis that filtering to only show alerts using kibana.alert.group.field : {ID FIELD} would do the trick:

Alerts for a host Image

Alerts for a kubernetes pod Image

I'm not 100% sure this it the right solution but I think it would ensure that alerts are explicitly grouped by the entity we're showing.

I like the elegance of just filtering by the entity you're looking for but I personally don't understand the behaviour of why those host alerts are showing against the cluster (and the pod alerts in the same cluster don't show up in the alerts app if I simply filter by the cluster name - edge-logs).

If figure if I can't figure it out, our users aren't going to either.

Any thoughts/ideas?