[Fleet][Proposal] Elastic Agent Debugger in Fleet

ruflin commented 2 years ago

The current Elastic Agent logs page in Fleet is driven by enabling logs and metrics collection from Elastic Agent. This means when logs and metrics collections are disabled or if logs and metrics will be sent to a remote cluster, it will not work. There is currently an overlap between stack monitoring and Fleet logs and metrics collection for Elastic Agent. This is a proposal on how to clearly separate the two.

What is Elastic Agent monitoring/stack monitoring?

Elastic Agent monitoring is collecting logs, metrics and traces from the Elastic Agent and store the timeseries over time. This can be enabled or disabled per policy or also happen through an external process if needed. The key is that this is historical data that is collected for a group of Elastic Agents.

Elastic Agent Debugger in Fleet

Fleet configures and runs Elastic Agent. The Elastic Agent debugger in Fleet is to get real time insights into an Elastic Agent. Instead of ssh into a machine, look at logs in real time, query metric endpoints or run a diagnostics command, this can be done directly in Fleet.

As soon as user navigates to the Logs page of an Elastic Agent, real time logs start to be streamed in for the user to look at. This is independent of which output is used and data is not persisted for the long term. In the overview of the Elastic Agent a user could create a diagnostics command for the Elastic Agent which streams the diagnostics file to the user in Fleet to download. This are just two examples.

Proposal

This is just a high level proposal and many additional things could make it into an Elastic Agent Debugger. Most things on the UX and technical side still need to be figured out.

elasticmachine commented 2 years ago

Pinging @elastic/fleet (Feature:Fleet)

AndersonQ commented 2 years ago

Perhaps it's a bit early on the discussion, but I believe it's worth to mention so we keep it in mind, using elastic/elastic-agent#192 as example, we should consider that the features which enable the debugging should take priority over others, start first, not fail if something else fails as they are the very features which will enable to debug what is failing/broken.

elastic / kibana