elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.77k stars 8.17k forks source link

[Stack Monitoring] Highlight when stats data set is missing for monitored components #155980

Open miltonhultgren opened 1 year ago

miltonhultgren commented 1 year ago

Background

A common source of support escalations and (informal) bug reports is when the UI doesn't show a monitored component even though it seems data collection is working as expected.

The reporter is drawing the conclusion that data collection is working because the right indices are present (e.g. .ds-.monitoring-es-8-mb-2023.04.24-000888 for Elasticsearch) but this doesn't tell the whole picture as the indices contain multiple data sets. If the stats (see list below) documents are missing from the index at the inspected time range then Stack Monitoring UI will not display the component (usually because we pull the ID and cluster UUID from those documents).

It's not uncommon for there to be failures to collect this particular data set since those endpoints are often more sensitive to resource pressure or configuration issues. However, this is often visible in the collection logs so investigation should start there.

Note: For Agent based collection, you get different data streams for each data set but users might still too quickly assume that the presence of any data set means collection is working since they may not be aware of the importance of the stats stream and thus not verify that that data stream exists.

Relevant data sets:

Goals

While we might not be able to improve the success rate of collection, we can do a better job at highlighting this situation in the UI, which hopefully will lead to

which should speed up time to resolution.

Ideally, when trying to resolve which components exist we run wider queries to see if we find data sets for other components but which are lacking the stats data set. While we may not be able to express how many or which exact components we find (or which cluster they are related to) we can still highlight for example that "we found Kibana metrics for this cluster but no stats" (the wording here needs to be more user adjusted) and hint that users should look into why that might be.

Possible challenges

The UI tries to resolve the Elasticsearch clusters first so we should likely start there, which happens in a few different ways in the UI code (depending on which route you navigate to) and we'll need to decide if we should call this out in the No Data page only, or if it makes sense to also show this in the case that there are multiple clusters in the data but one is missing stats. As usual, the standalone cluster is something we have to manage.

Once the Elasticsearch cluster is resolved we go to the Overview page where we try to resolve the related components and here we'll also need to highlight if we found no stats or if we're missing stats for one instance only (for Kibana for example).

To make things a bit more complicated, on the API side, this is all one large endpoint that resolves these and simply returns a list of clusters with their related components. We may need/want to restructure this flow but that makes the task significantly larger but it may be hard to add new data into that response otherwise. Another option is to run these "scout" queries as a totally isolated flow but that might make it harder to decide where to plug those in.

Related https://github.com/elastic/kibana/issues/130577 (which talks about the reverse, supporting that any other metircset is missing)

Acceptance criteria

TBD

elasticmachine commented 1 year ago

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

klacabane commented 1 year ago

I wonder if health endpoint could help as it returns all monitoring data found down to the entities/datasets (and other niceties like collection errors). The endpoint mainly targets support but we could go step further and expose it to users. Thinking the UI could consume this API to surface potential issues (eg no cluster_stats documents found)

miltonhultgren commented 1 year ago

Yes, I had thought of that but I just wasn't certain if we wanted to hook the "support" tool into the UI but I think APM is doing this for their new health endpoints. It would be a smaller scope to work with.