jasonrhodes commented 3 years ago

Summary

A React component exists that allows a caller to request a given "node type" along with a time range and a KQL filter and receive a table containing a list of matching entities with associated (pre-defined per node type) metrics.

Example: something similar to this table (only what's below the graphs shown here, for now)

Tickets

MVP component for use in APM

Follow-up issues for Infra Monitoring UI

Possible later stage considerations:

elasticmachine commented 3 years ago

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

miltonhultgren commented 3 years ago

Do you imagine that this component will self manage the data fetching? (which should be fine, given this is not to be shared cross domain and is likely simpler/less configurable than t-grid)

weltenwort commented 3 years ago

Do you imagine that this component will self manage the data fetching? (which should be fine, given this is not to be shared cross domain and is likely simpler/less configurable than t-grid)

It might still pay off to write the table itself as a component that only takes data, loading states and loading callbacks and keep the data fetching separate in one or more hooks. This doesn't mean we can't also provide an all-in-one component for simple use, but makes the parts more extensible.

jasonrhodes commented 3 years ago

+1 on what @weltenwort describes -- if that warrants two tickets, please split them!

weltenwort commented 3 years ago

I see three parts to this:

design and implement the components so they're the right style and size for embedding
design and implement the APIs for fetching data suitable for being called from any part in Kibana (which means it's robust in regard to permissions, race conditions, and failure states)
find a good place for the components and APIs so they can be imported by as many parts of Kibana as possible without circular dependencies

One way of breaking it down and parallelizing it could be to start solving the dependency problem and create mockups for the UI components in parallel. And once that is done go on to the implementation steps of both the components and the APIs.

weltenwort commented 3 years ago

I've created #117344 to track the circular dependency avoidance investigation. It's mainly a placeholder for now until I manage to write down some specifics.

Do we want to make the remainder of this smaller by narrowing it to the "metrics-in-apm" case for now? And can we clarify some details like

Where to the data come from? Is the user expected to ingest them in parallel into metrics-* indices or will the APM server put them somewhere into special APM indices?
Which node types do we want to support initially and eventually?
What metrics exactly should these columns display?
How to handle the case when only some metrics are available?
Are the node ids clickable? What happens if the user clicks one of them?

mukeshelastic commented 3 years ago

@formgeist @katefarrar @alex-fedotyev @danielkhan Very thoughtful questions about the scope of what we need to accoomplish in this.

I will take a stab at responding to these questions here and y'all can review and provide input on the suggestions.

mukeshelastic commented 3 years ago

I've created #117344 to track the circular dependency avoidance investigation. It's mainly a placeholder for now until I manage to write down some specifics.

Do we want to make the remainder of this smaller by narrowing it to the "metrics-in-apm" case for now? And can we clarify some details like
* Where to the data come from? Is the user expected to ingest them in parallel into `metrics-*` indices or will the APM server put them somewhere into special APM indices?

This is ingested in metrics-* datastreams via agent and not by APM server.

* Which node types do we want to support initially and eventually?
pardon my ignorance, but what is a node type in this context? If by node_type you meant, VM host, K8s pod or docker/CRI container then it is all of them, right from the beginning. If node_type is something else, then please let me know what that means.
* What metrics exactly should these columns display?

My first instinct is that we can begin with just embedding the tabular view of hosts, pods, containers as is.. that is, APM will pass the list of hosts, containers and pods to the inventory view embeddable and the view will be filtered for those values in the APM->infrastructure tabs respectively. Once we get past that then we can add additional metrics.

* How to handle the case when only some metrics are available?

* Are the node ids clickable? What happens if the user clicks one of them?
The embeddable should bring the entire experience of inventory view to APM>infrastructure tab.. that is, we show the list of hosts and then users can click on one and it will show the enhanced host details..

Additional questions that are worth looking into are:

Are the filters exposed to users? If they are, what value would they add? if they aren't, what are users going to miss?
Are the metrics graphs shown in the mocks useful as is.. ie. avg, max, min for cpu, mem? or should we instead show these metrics for each 'entity to identify the anamolous host/container/pod?
I am unclear why number of active instances over time is important.. I am probably missing some troubleshooting scenario that needs that info.. So it would be good for APM folks to help us understand that better.

weltenwort commented 3 years ago

Thanks for clarifying some of the points.

If by node_type you meant, VM host, K8s pod or docker/CRI container then it is all of them, right from the beginning.

yes, it's host, container or pod

My first instinct is that we can begin with just embedding the tabular view of hosts, pods, containers as is.. that is, APM will pass the list of hosts, containers and pods to the inventory view embeddable and the view will be filtered for those values in the APM->infrastructure tabs respectively.

so it would only contain the name column and no metrics initially?

The embeddable should bring the entire experience of inventory view to APM>infrastructure tab.. that is, we show the list of hosts and then users can click on one and it will show the enhanced host details..

That's good to keep in mind as a goal, but can you imagine an acceptable smaller step that we can take first? It would help if we could come up with a sequence of additions that can be tackled incrementally.

mukeshelastic commented 3 years ago

so it would only contain the name column and no metrics initially?

That is my take but I'd like to align with Alex, Casper, Daniel and Kate before we make a final call whether MVP should contain it or not. Alex, Casper have done lot of prior thinking on it so I'd like to make sure we make the right UX call here in alignment with APM.

That's good to keep in mind as a goal, but can you imagine an acceptable smaller step that we can take first? It would help if we could come up with a sequence of additions that can be tackled incrementally.

Definitely worth exploring the incremental and yet acceptable steps we could take to ship something sooner. I am operating under assumption that the enhanced host details flyout would just work independent of which kibana UI it is used in. If that isn't the case then we'd need smaller incremental step, like show the tabular inventory view but to see enhanced host details, you are linked to inventory page. Not a great UX but a step in the right direction.

Happy to hear if you have thoughts on additional incremental steps we could take.

weltenwort commented 3 years ago

I am operating under assumption that the enhanced host details fly-out would just work independent of which kibana UI it is used in.

It could certainly be made to work, but not with zero effort. Here are some variations I could think of that have different complexities:

There is no interaction with a row. This is likely the least effort.
Clicking the row navigates to the node detail page on the metrics UI.
Clicking the row shows a popover of some sort. For this we might have to specify additional requirements, e.g which time range does it show? does it auto-update? how is it rendered so it doesn't cover anything important in the underlying page?

I'm not too familiar with the metrics UI code, so there might be additional options.

weltenwort commented 3 years ago

:memo: Clarifications extracted from conversation on 2021-11-08

The component uses the space's default source configuration (which means the user is responsible for setting up the metrics ui to show the ingested metrics).
The component takes as props:
- the time range
- the node type
- the ids of the nodes
The component shows an empty state with guidance when there are no metric indices (like the metrics ui).
The component shows a different empty state with guidance when there are metric indices, but no data in the selected time range.

jasonrhodes commented 3 years ago

@formgeist do you have the recording from our meeting end of last week?

formgeist commented 3 years ago

@formgeist do you have the recording from our meeting end of last week?

@jasonrhodes I've sent to you in DM. It's also in the meeting invite description 👍

jasonrhodes commented 2 years ago

I just talked to @katefarrar and here is what we think the MVP requirements are:

A shared table component exists that can, for a given node type (of either host, pod, or container), show a pre-selected set of metrics in a tabular view.
- TODO: These per-node type metrics TBD from @alex-fedotyev and @mukeshelastic.
- If these metrics are meant to be point in time, they should use the "last 1 minute" semantics from Metrics UI. If, however, they are meant to be aggregations over a range (more likely), we should pass in the currently selected time range from the APM view.
- Sparklines are likely out of scope for this MVP (we will investigate whether we can get data to power these using the existing APIs, but I don't think it's currently possible)
- Each value in the "name" column should link to the Metrics UI node detail page (e.g. /metrics/detail/host/{id}) [1]
This component is included in a new "infrastructure" tab on the APM service page, one for each relevant node type (tabs will be implemented on this page directly, using the shared component within the tab's content)

Note: It's okay to split these two bullets into 2 separate tasks, (1) create the component with a storybook UI (like we have for the log stream), and (2) embed in APM.

Stretch goal:

View all in Metrics Explorer [2]. We already have all of the following information for a given instance of this shared component:
- Time range
- List of metrics for that node type
- A "group by" value determined by the node type (e.g. host.name for hosts)
- The name of the service to use as a filter

[1] Metrics UI node detail page

[2] Metrics explorer using information we have in the planned shared component

formgeist commented 2 years ago

View all in Metrics Explorer [2]. We already have all of the following information for a given instance of this shared

@jasonrhodes Do you mean to add a general link from each table list to view all the instances in one single Metrics Explorer view?

jasonrhodes commented 2 years ago

@formgeist yeah, something like that. I'm not sure how we'd position or design that link exactly, but the Metrics Explorer would be able to show exactly what the table shows, but in graph format (basically "sparklines" but full graphs and not sparklines), so it feels like a wasted opportunity to not link to it. But I don't want that to block the rest. Thoughts?

formgeist commented 2 years ago

View all in Metrics Explorer [2]. We already have all of the following information for a given instance of this shared component

@jasonrhodes @katefarrar and I discussed this in a sync yesterday, and agreed that we could pursue an option like this for the MVP. @katefarrar will create a mock that shows of where and how this works.

We had a few ideas on whether it should just simply display all the nodes, or we should limit the selection to the top 10 - but that means we'd need to differentiate between top 10 CPU or memory metrics. The metrics explorer is built to display lots of metrics charts for each node and paginate if the count explodes. I reckon for the first iteration we can show an option to display all the nodes and we can iterate on whether we want to supply more specific options for top 10 or individual nodes.

katefarrar commented 2 years ago

@jasonrhodes @katefarrar and I discussed this in a sync yesterday, and agreed that we could pursue an option like this for the MVP. @katefarrar will create a mock that shows of where and how this works.

Here is an idea for how we could link to the Metrics Explorer: Kapture 2021-12-03 at 12 03 44

Prototype

@alex-fedotyev @formgeist @jasonrhodes curious to hear any feedback you have. thanks!

formgeist commented 2 years ago

@katefarrar I think the design makes sense in this way, but we also have to be mindful of not giving the user too many navigation options around the same area without having a clear direction of what the user should be doing in these views. I know that we've also discussed offering the option to filter by the node individually, so that's a 3rd option. @alex-fedotyev thoughts on adding this option to visualize all the nodes in the Metrics Explorer UI?

alex-fedotyev commented 2 years ago

@katefarrar - I missed replying on this.

There is a huge technical limitation today which won't make this integration work as a quick win (the scope is not easy to transition when moving from APM to infra due to lack of common entity model - it would work for some transitions and mainly suffer for containerized apps).
Since this won't be an easy quick win, should we focus on more "north star" design for this flow?

katefarrar commented 2 years ago

@alex-fedotyev that sounds good for the MVP. Thanks!

smith commented 2 years ago

Added elastic/kibana#131308 as a follow-up item.

jasonrhodes commented 2 years ago

There are a lot of follow-up issues attached to this epic. Should we categorize them so that we have some sense of "done" for the current round of development and move the rest into the backlog for future work? Or maybe leave this as an ongoing Epic but create new ones to represent the subsets of work we want to focus on?

miltonhultgren commented 2 years ago

Here is my vote for what things we should do now and in which order:

Issue	URL	Effort
Calculate uptime correctly	https://github.com/elastic/kibana/issues/133119	Small
Use correct field for Pod/Container CPU usage	https://github.com/elastic/kibana/issues/133122	Small
Scale percentage values	https://github.com/elastic/kibana/issues/133124	Small
Show percentage memory usage	https://github.com/elastic/kibana/issues/133123	Small
Truncate name column	https://github.com/elastic/kibana/issues/130642	Small
Add module filters	https://github.com/elastic/kibana/issues/131308	Small/Medium
Support Docker only environments	https://github.com/elastic/kibana/issues/133125	Small/Medium
Add empty states	https://github.com/elastic/kibana/issues/127742	Medium
Verify that linked node details pages work	https://github.com/elastic/kibana/issues/128639	Medium

I think these things can be left for later, or need product input:

Issue	URL	Effort
Change filter interface	https://github.com/elastic/kibana/issues/132128	Small
Data accuracy communication	https://github.com/elastic/kibana/issues/128643	Medium
Use terms agg	https://github.com/elastic/kibana/issues/128645	Large
Add aggregation charts	https://github.com/elastic/observability-design/issues/141	Large
Adding "telemetry"	https://github.com/elastic/kibana/issues/128642	Unknown
Migrate to new docs platform	https://github.com/elastic/kibana/issues/127862	Unknown

jasonrhodes commented 2 years ago

@miltonhultgren this looks good -- let's get those top ones pulled directly into Ready on the current cycle board if they aren't already.

I think we can leave "data accuracy communication" up to APM — they can mention it on their tab rather than it being a decision we make at the shared component level, what do you think?

Same goes for telemetry, I think. We may want to have our own top-level telemetry but for now, we can leave that up to APM if they want specific telemetry added to their own tab.

Lastly, I'd like to prioritize changing the interface but I agree it's not important enough to pull it in quite yet.

Thanks, @miltonhultgren !

miltonhultgren commented 2 years ago

@jasonrhodes About data accuracy: It's mainly because we're using the composite agg, so we don't get any full set sorting on Elasticsearch.

Even with a terms agg it would still be possible inaccurate but within the usual norms for Kibana (which we can easily explain as "this is how ES works").

So perhaps more important than that is swapping to a terms agg which we can do behind the scenes.

I updated the Epic description and our board.

jasonrhodes commented 2 years ago

@miltonhultgren oh sorry, yes I know exactly what it refers to. I just don't think the component should make the decision for the user of the component about whether to add some disclaimer to the UI about this possible discrepancy. I think it's likely a better idea to mention the situation in the component docs and then let the users of the component make their own decision about how to message about this if they want to, using their own messaging next to the shared component.

Does that make sense?

smith commented 2 years ago

This is shipping as beta in 8.5. Closing this issue. We can prioritize any follow-up issues as needed.

elastic / kibana

[Epic] Create re-usable component for infrastructure metrics #115235

Summary

Tickets

MVP component for use in APM

Follow-up issues for Infra Monitoring UI

Possible later stage considerations:

:memo: Clarifications extracted from conversation on 2021-11-08