elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.73k stars 8.14k forks source link

[Fleet] Agent status improvements #75236

Open mostlyjason opened 4 years ago

mostlyjason commented 4 years ago

Describe the feature:

Currently, the fleet page shows the status of agents including whether they are online, offline, or have an error. It also shows whether agents are out of date, and enrolling or unenrolling. However, there is no way to see which agents have integrations that are reporting errors or are unhealthy. Instead these agents are reported as online and green, and this may be misinterpreted as healthy. We need a better way to indicate to administrators that agents are not running as expected and require attention. Endpoint security reported this use case https://github.com/elastic/kibana/issues/74708

I'd like to propose refactoring the statuses so that the fleet page shows:

Additionally, we can indicate when there are manual agent binary updates or agent policy available using a separate flag.

The reason we'd want to provide a summary of statuses on the overview page is to provide a rollup so fleet administrators can determine what is in flux and what requires their attention. Administrators can also filter the list to see just the set of agents requiring their attention, and combine that filter with others to look at a particular agent configuration or integration. Optionally, there could be a way to display sub-status information like "Updating: enrolling".

The agent details page will show both the overview status and the finer-grained status information to help users identify the cause of problems. It will provide a way for users to see which integrations are healthy, which are disabled due to user preference or condition, and which have errors or failed a health check along with more information on the reason why. There may be a summary of the health for each integration, and the user can see the activity log for more detail.

This also allows us to communicate the status of deployments using the same statuses, rather than having separate statuses just for deployments. https://github.com/elastic/kibana/issues/72537

Describe a specific use case for the feature:

elasticmachine commented 4 years ago

Pinging @elastic/ingest-management (Team:Ingest Management)

mostlyjason commented 4 years ago

@hbharding I'd be interested in your input on this

hbharding commented 4 years ago

Hey @mostlyjason, thanks for putting this together. I think this simplifies a lot. I especially like that we can use these same statuses to communicate deployment status.

I created a Whimsical diagram that attempts to capture everything you've described. I organized the diagram so that statuses on the left will always supercede statuses on the right if any of the conditions inside are true. For example, if a policy is "unenrolling", it can not also be in a "unhealthy" or "healthy" state.

image

I shared this in our meeting yesterday with Endpoint, and there were questions about items inside the "unhealthy" status. "Unhealthy" makes sense when some integrations could have issues while other integrations are running fine. But what if the agent is "online" and has an error that prevents all data from being sent? Shouldn't we elevate this type of status so that it appears to be more critical? Perhaps it makes sense to introduce a red "error" status like so:

image

Some questions I have are:

hbharding commented 4 years ago

Also, to recap a discussion from yesterday:

re: Integration errors, we talked about maybe adding a way to "pivot" the agent table so that it is focused on policies. If an agent is unhealthy due to an integration error (Endpoint, for example), it is likely that multiple agents will have the same issue because they use the same policy. On the Fleet page, if we report 200 agents as being "unhealthy", how can the user isolate the agents to only see agents that have unhealthy because of an Endpoint Integration error?

nchaulet commented 4 years ago

I don't think its possible to detect if an agent is "enrolling". I think "new" agents would just appear with a status of "healthy", "unhealthy", or "error"

You are right the enrolling status we have now is more an enrolled status, should we have an enrolled status for agent between the enrollment and the first checkin?

ph commented 3 years ago

@michalpristas or @nchaulet I can't find the issue for the Elastic Agent related to this effort did you ever created one?

nchaulet commented 3 years ago

@ph there as no specific issue for that but this was partially implemented here https://github.com/elastic/kibana/pull/84434 (adding the Healthy, unhealthy, updating status) There is no per integration status now as we postponed this and the status is still computed by Kibana and not reported by the agent so we do not have the Updating Policy status

mostlyjason commented 3 years ago

Just want to describe the goal for the next phase is to so expose improved status for inputs in the Agent details page, filtered by integration. That applies to the second user story:

As a Fleet administrator, I'd like to get detailed information about why an agent is not healthy so that I can troubleshoot and fix it. I'd like to identify which specific integrations and error messages are reporting the problem.