Closed miltonhultgren closed 1 year ago
Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)
@pmeresanu85 We had a team time session with Yoann from the Cloud team, and I got a chance to ask about the summary graphs he's placed on their dashboards.
They show a graph with a line per host on 3 metrics (CPU, memory and disk throughput) and I asked why not show a single metric value or a trendline for the group as an average and he said that it wouldn't tell you very much because it's just an average of averages of averages.
So I think we should consider in our view to show the same kind of graphs, since it shows trends for the system as well as outliers which might hint at issues to drill into with filters.
@miltonhultgren very good comment, let's follow up on Yoann's feedback.
The ones above are averages (KPI trends). Also we need to be mindful that this is 1 feedback point, we shouldn't change course based on 1 single feedback point in my opinion. Think we shouldn't be aiming to produce a perfect product in v1, better have an iterative approach.
I wonder if we talked to Yoann about our metric dashboards in context, which are the exact ones they have.
Discussed with @miltonhultgren
Goal: SRE want to find the groups of hosts relevant to them.
Goal: SRE wants to troubleshoot / remediate infrastructure problems in context of metrics & alerts
Note: Host Map/ Host Graph should become part of the analysis view as a minimal functional set, showing only group context & KPIs without individual host selection.
MVP : For the MVP of the feature we would only require the 2 step workflow (listing/filtering hosts & analysing host group in context of metrics and alerts). If we can fit in the Host Group object from an effort perspective this can be part of the MVP as well.
Some more thoughts: I think for right now we should build the MVP as a single page.
This means there will be some things on the page that make sense for the "find group of hosts" goal and some that make sense for the "analyze problems within a group" goal. Likewise, there will be things that are missing for each goal and that don't make sense in the context of the other goal.
We should accept that for now and address such problems when we have the time to split this into two views (with the persistence of groups and navigation flows to reach each type of page).
On the Analysis page: We likely will get to a point where the user wants to customize which metrics to show (in the graphs, KPI and table) based on the hosts in that group so we should consider how to offer such flexibility and bake it into the persistence of the group.
On the Find hosts page: We should use the filtering controls to teach users about how the filter bar works, when they select "AWS" we should put that as a query into the filter bar. We might also take a page out of the e-commerce world for this view, which is all about finding the right things. When showing options to filter on we could for example show a count of how many hosts fit that filter (AWS: 10 hosts), this might give the users some insight into what exists in their fleet.
For the Host groups: Once we arrive at a point where we have a landing page which shows your different Host groups (like in APM) we could also use those tiles to quickly surface some KPIs for each group, making it easier for an analyst to know which group to check up on first.
Some meta thoughts: While this kind of design adds more pages/steps for the user, that might be an important part of what makes an experience/workflow versus what is "just" a dashboard.
@miltonhultgren let's proceed based on these 2 comment above
We spoke a bit more about this in our team time: Some ideas that came out of that is that maybe we should borrow a bit more from what we do in Fleet (list of agents, agents grouped into policies).
One thing that we all acknowledge is the need for the landing page to be appealing. One idea would be to show metrics about the "fleet" of hosts instead, like how many hosts there are, and a pie breakdown of their OS type, or some other "group" type data. Rather than showing CPU trends which might be more relevant once you filter down the list.
Another thing we could put in to brighten the page up is callouts for collecting data from more hosts. If for example we see that we have 100 hosts in the system module data, but APM data reports 160 hosts, we can put a call out saying "hey, you have another 60 hosts in your system that you could instrument and put into a host group". Perhaps there are other such guiding steps we can share.
We also arrived at the question that it might be useful to compare two host groups to each other, but we're not clear on how that would look. If we have a global list with a "group by" sort of query, then we could do this in one page by reducing a group into a single item in the lists/graphs but it's not clear how that would fit into a design with two views.
We were never able to come up with an appropriate design to handle 2-step views and concluded to go with one view. Also in part because there is some favor from the Design team that a single view is better UX. Unless we get input from the Design team otherwise or they have the bandwidth to finalize a 2 step design, I think we need to go with the single one for now. CC @formgeist
I still think these summary metrics of the "group" could be useful, given the Host Map demo would show each individual host on the chart, this could be an "Host Group" summary metric. The Host Map demo showed the summary metrics based on the metrics being shown in that visualization which are the metrics: CPU, Memory Usage, RX, and TX. So it would probably make sense to show these metrics and not Disk Usage and Disk Latency. @pmeresanu85 Would you agree?
We were never able to come up with an appropriate design to handle 2-step views and concluded to go with one view. Also in part because there is some favor from the Design team that a single view is better UX. Unless we get input from the Design team otherwise or they have the bandwidth to finalize a 2 step design, I think we need to go with the single one for now.
In agree with the above paragraph, based on the guidance we got from UX, my suggestion would be to go with a single view (vs 2 step design).
I still think these summary metrics of the "group" could be useful, given the Host Map demo would show each individual host on the chart, this could be an "Host Group" summary metric. The Host Map demo showed the summary metrics based on the metrics being shown in that visualization which are the metrics: CPU, Memory Usage, RX, and TX. So it would probably make sense to show these metrics and not Disk Usage and Disk Latency.
Agree. Let's stick to showing summary metrics of the "group". Additionally let's stick to showing CPU, Memory usage, RX and TX
Updated issue with specific metrics to be used.
It says to use the snapshot API but I'm not sure that provides the data we need here or averages across all results. If we have to calculate averages on the client I'm not sure if that will work with the results (though these might not be paginated on the server). If the API doesn't do what we want we may want to use lens visualizations instead. If this is the case let's update the description.
@formgeist will update this issue once there is a clear design for this.
@smith Not sure if intentional, but the descriptions mentions "CPU utilization" and "Memory utilization" instead of the more common name "CPU usage" and "Memory usage" - is there a specific reason to use a different naming here? I would assume we'd try to keep them consistent with the Hosts table?
@formgeist We should continue to use "usage" over utilization for consistency.
Updated text to use "usage" instead of "utilization".
I think we can move this out of refining. At present it can go on the single view. I understand we might be moving it into tabs later based on @formgeist designs but that shouldn't be too difficult to move.
The snapshot api doesn't support this kind of "list of hosts" summary metrics. We'd likely create a new api (which could be reused for charts or a summary of groups of hosts if we ever had a landing page for that) if we went with using Elastic charts metric visualization. I think we decided to go with our own API / Elastic charts metric visualization to make things more consistent instead of using Lens for some metrics and not others and having more customizable options and other issues we had with the table such as erroring out when the field doens't exist in the Data View.
Any objection to using either Snapshot API or Metric Explorer API for this(both internally run the same code to build the query)? They would provide us with all the metrics, but hosts count, out of the box:
Hosts count could be new inventory model using cardinality aggregation
Yea I was thinking the same. Sounds good to me!
Great interview with Lucas Moore (one of our SREs). He really liked these charts so this is good news!
@crespocarlos and @roshan-elastic spoke about query sizes. Given we want to prioritise performance/responsiveness/response times, we discussed some suggestions:
KPI Tiles
Table
Other
@roshan-elastic @neptunian @formgeist meeting notes:
I'm making some decisions on requirements now (for the sake of time) but feel free to discuss/challenge these. I want to make sure I'm not asking for crazy :)
When we return the default result view:
Either result view or one which is filtered/queried:
KPI Tile population indication - Make it clear to the user that when you change the result set (e.g. to 1,000), it affects the KPI tile numbers shown. Additionally, sorting will have the same effect so make this clear.
Allow sorting - Allow the user to sort the host table by any of the metrics
Ensure result set is not skewed by sorting method - I don't know if the API allows you to return the result by top X without it skewing. For example, if you returned the result set ranked by 'CPU util' then that would cherry pick the busiest CPUs and not be a good way of returning a result set you wanted to run averages on.
Thought on the above : I can imagine that in future iterations, we may change the way the KPI tiles and metrics charts work so that they can show KPIs for the full population rather than the smaller result set we're returning (for a user, it's just weird that when you change the sorting of the table it would affect the KPI tiles - given the constraints we have now though I can't see a way around it). That way, sorting the table wouldn't affect the KPI tiles (and we wouldn't need some weird method of 'unsorting' the table so that we avoid skewing in the KPI tiles).
Update from last meeting (@formgeist @crespocarlos + @roshan-elastic )
1. KPI Tiles - Hosts - X out of X We need some way of showing both the number of hosts returned in the segment but also the number of total hosts in the query (so the user understands that we're showing data for a subset, e.g. 100, from the total population, e.g. 462)
2. KPI Tiles - Indicate that they are affected by selected host limit, e.g. 100, 50, 10 - whatever is selected We need to indicate this so the user understands they aren't showing for the whole population
3. KPI tiles - affected by sorting We need to make it intuitive or clear that the KPI tiles are affected by a user sorting
4. Sampling - Default sorting By default, we can just return the top X ranked by timestamp of last data sent (this isn't perfect sampling but it's good enough to start with and we can always iterate if we find users want to improve on this)
5. Sampling - return to default sorting If the user sorts by a metric/column, we need to make sure the user can return to the default sorting.
6. Query size controls We need some way of allowing the user to set the number of results coming back (suggest we start with a default of 50 but we can hopefully change this if we find this is not ideal).
7. Long query loading prompt We need some way to indicate that a query is taking a long time and perhaps suggest ways the user can speed up the query (e.g. reduce the time range or number of hosts returned). It would be awesome to suggest to a user in advance whether a query is going to be feasible or not (before they make it) but I'm not sure how technically feasible this is!
We did discuss looking at whether we can allow the user to 'cancel' the query to stop creating a queue of requests on the server but I don't think we need to prioritise this right now unless it's simple
8. Click through on KPIs We discuss that for now we can disable this but when we work on the metrics tab, we likely want users to click on a KPI and lead to more of a 'drill-down'/'deeper' analysis of the data (most likely, coming through into the 'metrics' tab with the relevant data)
9. Limit time range We need to investigate whether we can add some limits to the date range (I know we didn't want to limit potentially useful queries from a user but after seeing that we could set the time-range to several years...we probably want to limit this period somewhat if we can...)
10. Losing the displayed hosts when you change the time range We realised that if you change the time range, you could potentially completely lose the current hosts showing in the current table/charts. We don't believe this is naturally intuitive for the user but it seems like a difficult problem to solve right now so most likely we don't need to worry about this right now - I think this is something for post 8.7
11. Responsive/mobile view We saw that the table looks a bit crazy at the moment if you view on a phone. Checking the telemetry in Fullstory - only 0.6% of inventory/metrics explorer/hosts cloud users are on mobile or tablet (about 2% of all Elastic cloud users) so we probably don't need to worry about this any time soon.
Always return total number of hosts in hosts KPI tile (independently loading) - We should show the total number of hosts (without a trend line) as a indication of how many hosts we're tracking (I'm assuming it's not too expensive to return this independently of the rest of the tiles/table). It's also useful as a way to train the user if the query they're running is massive.
Buckets per trend line : We should agree some sort of limit to the granularity in the trend lines (e.g. daily for anything larger than 14 days, hourly for a day, 60 for a minute etc).
Regarding the item below, the page is loads the KPIs based on the default 15 minutes that is set in the unified search as soon as the page loads. Should we change this behaviour?
Other KPI tiles - Indicate to the user that these will not populate until you add some kind of a filter (they're useless anyway unless you do a filter according to our SREs).
Default result set to 50 but allow controls to reduce/increase to any size - This will control the number of results used in the KPI tiles and the table itself
We'll need to change the query to run terms
aggregation instead of composite
aggregations
KPI Tile population indication - Make it clear to the user that when you change the result set (e.g. to 1,000), it affects the KPI tile numbers shown. Additionally, sorting will have the same effect so make this clear.
Since the KPIs run the same base query, this should work without any additional change
Allow sorting - Allow the user to sort the host table by any of the metrics
It's currently possible to sort the table. The main change here is to sort the data on ES side as a direct impact of limiting the number of hosts returned by the API
Ensure result set is not skewed by sorting method - I don't know if the API allows you to return the result by top X without it skewing. For example, if you returned the result set ranked by 'CPU util' then that would cherry pick the busiest CPUs and not be a good way of returning a result set you wanted to run averages on.
This needs further investigation
Hey @crespocarlos, looking at these ones - I wasn't sure whether it would be best for you administer this (e.g. create issues/tasks which make sense to you) or whether I should create them?
Let me know what you prefer!
BTW I've added in the top comment, a list of items for the current ticket. Feel free to add items to this as you need:
The items below could be in a separate ticket as they aim to address a different problem
Default result set to 50 but allow controls to reduce/increase to any size - This will control the number of results used in the KPI tiles and the table itself
We'll need to change the query to run
terms
aggregation instead ofcomposite
aggregations
KPI Tile population indication - Make it clear to the user that when you change the result set (e.g. to 1,000), it affects the KPI tile numbers shown. Additionally, sorting will have the same effect so make this clear.
Since the KPIs run the same base query, this should work without any additional change
Allow sorting - Allow the user to sort the host table by any of the metrics
It's currently possible to sort the table. The main change here is to sort the data on ES side as a direct impact of limiting the number of hosts returned by the API
Ensure result set is not skewed by sorting method - I don't know if the API allows you to return the result by top X without it skewing. For example, if you returned the result set ranked by 'CPU util' then that would cherry pick the busiest CPUs and not be a good way of returning a result set you wanted to run averages on.
This needs further investigation
I wasn't sure whether it would be best for you administer this (e.g. create issues/tasks which make sense to you) or whether I should create them?
I can create them.
āļø Task List
Open Questions = Needs facilitation by PM Current Tasks = Can be owned individually (reach out to whoever for feedback/clarification)
Open Questions
Current Tasks
š Description
In order to show the user a summary of information about the search result for hosts, we would like to show some summary metrics across the top of the page, below the search bar.
āļø Implementation details
The look, spacing, and layout should be identical to what's specified in the screenshots below:
Default state - query size
Filtered state - number of included hosts vs. total number of hosts
āļø Acceptance criteria
In the Hosts View, the following metrics and trends are displayed:
node
host.name
considering only the time range filtercompute
avg(system.cpu.total.norm.pct)
memory
avg(system.memory.actual.used.bytes) / max(system.memory.total)
sortLeft
counter_rate(system.network.in.bytes)
sortLeft
counter_rate(system.network.out.bytes)