elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.53k stars 8.07k forks source link

[Infrastructure UI] Add metric trends (KPIs) to filtered Hosts list #143535

Closed miltonhultgren closed 1 year ago

miltonhultgren commented 1 year ago

āœ”ļø Task List

Open Questions = Needs facilitation by PM Current Tasks = Can be owned individually (reach out to whoever for feedback/clarification)

Open Questions

Current Tasks

šŸ“– Description

In order to show the user a summary of information about the search result for hosts, we would like to show some summary metrics across the top of the page, below the search bar.

āš™ļø Implementation details

The look, spacing, and layout should be identical to what's specified in the screenshots below:

Note : Important notes/limitations to current scoped design The current scoped design should not include any concept of a limited query size - we hope to add this within the scope of 8.7 but at the time of writing this functionality is still in refinement. You can:

  • Remove any mention '(of X hosts')
  • Where the hosts tile shows X,XXX total, you can remove this
  • Where the hosts tile shows 'returned' - you can simply show the number below.

Default state - query size

image

Filtered state - number of included hosts vs. total number of hosts

image

āœ”ļø Acceptance criteria

In the Hosts View, the following metrics and trends are displayed:

Title Icon Tooltip Metric Format
Count of hosts node The number of hosts returned by your current search criteria. Total unique count of host.name considering only the time range filter integer
CPU usage (normalized, average) compute Average of percentage of CPU time spent in states other than Idle and IOWait, normalized by the number of CPU cores. Includes both time spent on user space and kernel space. 100% means all CPUs of the host are busy. avg(system.cpu.total.norm.pct) percent
Memory usage (normalized, average) memory Average of percentage of main memory usage excluding page cache. This includes resident memory for all processes plus memory used by the kernel structures and code apart the page cache. A high level indicates a situation of memory saturation for a host. 100% means the main memory is entirely filled with memory that can't be reclaimed, except by swapping out. avg(system.memory.actual.used.bytes) / max(system.memory.total) percent
Network inbound throughput (average) sortLeft Number of bytes which have been received per second on the public interfaces of the hosts counter_rate(system.network.in.bytes) bytes/s
Network output throughput (average) sortLeft Number of bytes which have been sent per second on the public interfaces of the hosts counter_rate(system.network.out.bytes) bytes/s
elasticmachine commented 1 year ago

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

miltonhultgren commented 1 year ago

@pmeresanu85 We had a team time session with Yoann from the Cloud team, and I got a chance to ask about the summary graphs he's placed on their dashboards.

They show a graph with a line per host on 3 metrics (CPU, memory and disk throughput) and I asked why not show a single metric value or a trendline for the group as an average and he said that it wouldn't tell you very much because it's just an average of averages of averages.

So I think we should consider in our view to show the same kind of graphs, since it shows trends for the system as well as outliers which might hint at issues to drill into with filters.

pmeresanu85 commented 1 year ago

@miltonhultgren very good comment, let's follow up on Yoann's feedback.

The ones above are averages (KPI trends). Also we need to be mindful that this is 1 feedback point, we shouldn't change course based on 1 single feedback point in my opinion. Think we shouldn't be aiming to produce a perfect product in v1, better have an iterative approach.

I wonder if we talked to Yoann about our metric dashboards in context, which are the exact ones they have.

pmeresanu85 commented 1 year ago

Discussed with @miltonhultgren

  1. Listing view

Goal: SRE want to find the groups of hosts relevant to them.

  1. Analysis view

Goal: SRE wants to troubleshoot / remediate infrastructure problems in context of metrics & alerts

Note: Host Map/ Host Graph should become part of the analysis view as a minimal functional set, showing only group context & KPIs without individual host selection.

MVP : For the MVP of the feature we would only require the 2 step workflow (listing/filtering hosts & analysing host group in context of metrics and alerts). If we can fit in the Host Group object from an effort perspective this can be part of the MVP as well.

miltonhultgren commented 1 year ago

Some more thoughts: I think for right now we should build the MVP as a single page.

This means there will be some things on the page that make sense for the "find group of hosts" goal and some that make sense for the "analyze problems within a group" goal. Likewise, there will be things that are missing for each goal and that don't make sense in the context of the other goal.

We should accept that for now and address such problems when we have the time to split this into two views (with the persistence of groups and navigation flows to reach each type of page).

On the Analysis page: We likely will get to a point where the user wants to customize which metrics to show (in the graphs, KPI and table) based on the hosts in that group so we should consider how to offer such flexibility and bake it into the persistence of the group.

On the Find hosts page: We should use the filtering controls to teach users about how the filter bar works, when they select "AWS" we should put that as a query into the filter bar. We might also take a page out of the e-commerce world for this view, which is all about finding the right things. When showing options to filter on we could for example show a count of how many hosts fit that filter (AWS: 10 hosts), this might give the users some insight into what exists in their fleet.

For the Host groups: Once we arrive at a point where we have a landing page which shows your different Host groups (like in APM) we could also use those tiles to quickly surface some KPIs for each group, making it easier for an analyst to know which group to check up on first.

Some meta thoughts: While this kind of design adds more pages/steps for the user, that might be an important part of what makes an experience/workflow versus what is "just" a dashboard.

pmeresanu85 commented 1 year ago

@miltonhultgren let's proceed based on these 2 comment above

miltonhultgren commented 1 year ago

We spoke a bit more about this in our team time: Some ideas that came out of that is that maybe we should borrow a bit more from what we do in Fleet (list of agents, agents grouped into policies).

One thing that we all acknowledge is the need for the landing page to be appealing. One idea would be to show metrics about the "fleet" of hosts instead, like how many hosts there are, and a pie breakdown of their OS type, or some other "group" type data. Rather than showing CPU trends which might be more relevant once you filter down the list.

Another thing we could put in to brighten the page up is callouts for collecting data from more hosts. If for example we see that we have 100 hosts in the system module data, but APM data reports 160 hosts, we can put a call out saying "hey, you have another 60 hosts in your system that you could instrument and put into a host group". Perhaps there are other such guiding steps we can share.

We also arrived at the question that it might be useful to compare two host groups to each other, but we're not clear on how that would look. If we have a global list with a "group by" sort of query, then we could do this in one page by reducing a group into a single item in the lists/graphs but it's not clear how that would fit into a design with two views.

neptunian commented 1 year ago

We were never able to come up with an appropriate design to handle 2-step views and concluded to go with one view. Also in part because there is some favor from the Design team that a single view is better UX. Unless we get input from the Design team otherwise or they have the bandwidth to finalize a 2 step design, I think we need to go with the single one for now. CC @formgeist

I still think these summary metrics of the "group" could be useful, given the Host Map demo would show each individual host on the chart, this could be an "Host Group" summary metric. The Host Map demo showed the summary metrics based on the metrics being shown in that visualization which are the metrics: CPU, Memory Usage, RX, and TX. So it would probably make sense to show these metrics and not Disk Usage and Disk Latency. @pmeresanu85 Would you agree?

pmeresanu85 commented 1 year ago

We were never able to come up with an appropriate design to handle 2-step views and concluded to go with one view. Also in part because there is some favor from the Design team that a single view is better UX. Unless we get input from the Design team otherwise or they have the bandwidth to finalize a 2 step design, I think we need to go with the single one for now.

In agree with the above paragraph, based on the guidance we got from UX, my suggestion would be to go with a single view (vs 2 step design).

I still think these summary metrics of the "group" could be useful, given the Host Map demo would show each individual host on the chart, this could be an "Host Group" summary metric. The Host Map demo showed the summary metrics based on the metrics being shown in that visualization which are the metrics: CPU, Memory Usage, RX, and TX. So it would probably make sense to show these metrics and not Disk Usage and Disk Latency.

Agree. Let's stick to showing summary metrics of the "group". Additionally let's stick to showing CPU, Memory usage, RX and TX

smith commented 1 year ago

Updated issue with specific metrics to be used.

It says to use the snapshot API but I'm not sure that provides the data we need here or averages across all results. If we have to calculate averages on the client I'm not sure if that will work with the results (though these might not be paginated on the server). If the API doesn't do what we want we may want to use lens visualizations instead. If this is the case let's update the description.

smith commented 1 year ago

@formgeist will update this issue once there is a clear design for this.

formgeist commented 1 year ago

@smith Not sure if intentional, but the descriptions mentions "CPU utilization" and "Memory utilization" instead of the more common name "CPU usage" and "Memory usage" - is there a specific reason to use a different naming here? I would assume we'd try to keep them consistent with the Hosts table?

neptunian commented 1 year ago

@formgeist We should continue to use "usage" over utilization for consistency.

smith commented 1 year ago

Updated text to use "usage" instead of "utilization".

neptunian commented 1 year ago

I think we can move this out of refining. At present it can go on the single view. I understand we might be moving it into tabs later based on @formgeist designs but that shouldn't be too difficult to move.

The snapshot api doesn't support this kind of "list of hosts" summary metrics. We'd likely create a new api (which could be reused for charts or a summary of groups of hosts if we ever had a landing page for that) if we went with using Elastic charts metric visualization. I think we decided to go with our own API / Elastic charts metric visualization to make things more consistent instead of using Lens for some metrics and not others and having more customizable options and other issues we had with the table such as erroring out when the field doens't exist in the Data View.

crespocarlos commented 1 year ago

Any objection to using either Snapshot API or Metric Explorer API for this(both internally run the same code to build the query)? They would provide us with all the metrics, but hosts count, out of the box:

Image

Hosts count could be new inventory model using cardinality aggregation

neptunian commented 1 year ago

Yea I was thinking the same. Sounds good to me!

roshan-elastic commented 1 year ago

Update

Great interview with Lucas Moore (one of our SREs). He really liked these charts so this is good news!

Search 'KPI Charts'

roshan-elastic commented 1 year ago

@crespocarlos and @roshan-elastic spoke about query sizes. Given we want to prioritise performance/responsiveness/response times, we discussed some suggestions:

Suggestions

KPI Tiles

Table

Other

Next Steps

roshan-elastic commented 1 year ago

@roshan-elastic @neptunian @formgeist meeting notes:

Decisions

I'm making some decisions on requirements now (for the sake of time) but feel free to discuss/challenge these. I want to make sure I'm not asking for crazy :)

When we return the default result view:

Either result view or one which is filtered/queried:

Open Questions

Thought on the above : I can imagine that in future iterations, we may change the way the KPI tiles and metrics charts work so that they can show KPIs for the full population rather than the smaller result set we're returning (for a user, it's just weird that when you change the sorting of the table it would affect the KPI tiles - given the constraints we have now though I can't see a way around it). That way, sorting the table wouldn't affect the KPI tiles (and we wouldn't need some weird method of 'unsorting' the table so that we avoid skewing in the KPI tiles).

Action Points

roshan-elastic commented 1 year ago

Update from last meeting (@formgeist @crespocarlos + @roshan-elastic )

Next Design Requirements

1. KPI Tiles - Hosts - X out of X We need some way of showing both the number of hosts returned in the segment but also the number of total hosts in the query (so the user understands that we're showing data for a subset, e.g. 100, from the total population, e.g. 462)

2. KPI Tiles - Indicate that they are affected by selected host limit, e.g. 100, 50, 10 - whatever is selected We need to indicate this so the user understands they aren't showing for the whole population

3. KPI tiles - affected by sorting We need to make it intuitive or clear that the KPI tiles are affected by a user sorting

4. Sampling - Default sorting By default, we can just return the top X ranked by timestamp of last data sent (this isn't perfect sampling but it's good enough to start with and we can always iterate if we find users want to improve on this)

5. Sampling - return to default sorting If the user sorts by a metric/column, we need to make sure the user can return to the default sorting.

6. Query size controls We need some way of allowing the user to set the number of results coming back (suggest we start with a default of 50 but we can hopefully change this if we find this is not ideal).

7. Long query loading prompt We need some way to indicate that a query is taking a long time and perhaps suggest ways the user can speed up the query (e.g. reduce the time range or number of hosts returned). It would be awesome to suggest to a user in advance whether a query is going to be feasible or not (before they make it) but I'm not sure how technically feasible this is!

We did discuss looking at whether we can allow the user to 'cancel' the query to stop creating a queue of requests on the server but I don't think we need to prioritise this right now unless it's simple

Other Notes / Open Points

8. Click through on KPIs We discuss that for now we can disable this but when we work on the metrics tab, we likely want users to click on a KPI and lead to more of a 'drill-down'/'deeper' analysis of the data (most likely, coming through into the 'metrics' tab with the relevant data)

9. Limit time range We need to investigate whether we can add some limits to the date range (I know we didn't want to limit potentially useful queries from a user but after seeing that we could set the time-range to several years...we probably want to limit this period somewhat if we can...)

10. Losing the displayed hosts when you change the time range We realised that if you change the time range, you could potentially completely lose the current hosts showing in the current table/charts. We don't believe this is naturally intuitive for the user but it seems like a difficult problem to solve right now so most likely we don't need to worry about this right now - I think this is something for post 8.7

11. Responsive/mobile view We saw that the table looks a bit crazy at the moment if you view on a phone. Checking the telemetry in Fullstory - only 0.6% of inventory/metrics explorer/hosts cloud users are on mobile or tablet (about 2% of all Elastic cloud users) so we probably don't need to worry about this any time soon.

Action Points

crespocarlos commented 1 year ago

To narrow the scope of this ticket, I suggest us to focus on:

Regarding the item below, the page is loads the KPIs based on the default 15 minutes that is set in the unified search as soon as the page loads. Should we change this behaviour?

The items below could be in a separate ticket as they aim to address a different problem

We'll need to change the query to run terms aggregation instead of composite aggregations

Since the KPIs run the same base query, this should work without any additional change

It's currently possible to sort the table. The main change here is to sort the data on ES side as a direct impact of limiting the number of hosts returned by the API

This needs further investigation

roshan-elastic commented 1 year ago

Hey @crespocarlos, looking at these ones - I wasn't sure whether it would be best for you administer this (e.g. create issues/tasks which make sense to you) or whether I should create them?

Let me know what you prefer!

BTW I've added in the top comment, a list of items for the current ticket. Feel free to add items to this as you need:

Link to tasks

The items below could be in a separate ticket as they aim to address a different problem

  • Default result set to 50 but allow controls to reduce/increase to any size - This will control the number of results used in the KPI tiles and the table itself

We'll need to change the query to run terms aggregation instead of composite aggregations

  • KPI Tile population indication - Make it clear to the user that when you change the result set (e.g. to 1,000), it affects the KPI tile numbers shown. Additionally, sorting will have the same effect so make this clear.

Since the KPIs run the same base query, this should work without any additional change

  • Allow sorting - Allow the user to sort the host table by any of the metrics

It's currently possible to sort the table. The main change here is to sort the data on ES side as a direct impact of limiting the number of hosts returned by the API

  • Ensure result set is not skewed by sorting method - I don't know if the API allows you to return the result by top X without it skewing. For example, if you returned the result set ranked by 'CPU util' then that would cherry pick the busiest CPUs and not be a good way of returning a result set you wanted to run averages on.

This needs further investigation

crespocarlos commented 1 year ago

I wasn't sure whether it would be best for you administer this (e.g. create issues/tasks which make sense to you) or whether I should create them?

I can create them.