[Fleet] Improve agent observability

mostlyjason commented 4 years ago

Summary of the problem We'd like to improve the observability for agents so that operators have better insights into problems and have enough information to troubleshoot and fix them in a timely manner. Additionally, the most insight we can share with users to fix issues on their own, the less often they will get stuck and need to file a support issue.

Potential scope, PM will need to better define it:

[ ] Extend design work on Fleet status https://github.com/elastic/kibana/issues/75236 to show detailed status information on the agent details page.
[x] Add an improved log UI to the agent details page https://github.com/elastic/kibana/issues/77189
[x] Consider making the endpoint policy response a first class citizen to offer more insight into problems
[ ] Determine how to display actions like upgrades or security response actions
[ ] Decide how to surface metrics information to help users troubleshoot performance and capacity issues, and define how these capabilities relate to stack monitoring.
[x] Links to view dashboards, logs UI, metrics UI, etc. filtered on this agent or host. #98505
[ ] Focus integrations view on operational use cases like observing health status and fix problems https://github.com/elastic/security-team/issues/244
[ ] https://github.com/elastic/kibana/issues/121885
[x] https://github.com/elastic/kibana/issues/102954
[x] https://github.com/elastic/kibana/issues/124240

**User stories***

[ ] As a Fleet user, I'd like to have better visibility to the health status of the agent and all the integrations running on it so I can identify problems.
[ ] As a Fleet user, I'd like to have better visibility to logs from the agent to troubleshoot and fix errors and other problems in a timely manner.
[ ] As a Fleet user, I'd like to have better visibility to metrics from the agent to troubleshoot and fix performance and capacity problems in a timely manner.

List known (technical) restrictions and requirements

Other PM Lead @mukeshelastic Design lead @hbharding Collaborators @mostlyjason

mostlyjason commented 4 years ago

@mukeshelastic I filed this design issue for planning purposes. Please review and update as desired.

katrin-freihofner commented 4 years ago

@mostlyjason it says here "...Potential scope, PM will need to better define it..." when do you think this issue will be ready to be picked up?

mostlyjason commented 4 years ago

@mukeshelastic is the PM lead for this issue so I'll defer to him.

I believe some parts are ready such as including the logstream component on the agent details page https://github.com/elastic/kibana/issues/77189

mukeshelastic commented 4 years ago

@hbharding and I discussed the two buckets in which we will need design support:

Researching and validating problems in agent observability with few user interviews.
Exploring and designing experiences we want to build for the MVP prioritized problems.

ravikesarwani commented 4 years ago

https://github.com/elastic/kibana/issues/81872

hbharding commented 4 years ago

Small update: per @mukeshelastic + @ravikesarwani, we want to scope the initial work for this ticket in https://github.com/elastic/kibana/issues/81872 and treat this issue more as an ongoing epic that will extend beyond 7.11.

cc @mostlyjason @ph @katrin-freihofner

elasticmachine commented 3 years ago

Pinging @elastic/fleet (Team:Fleet)

mtojek commented 2 years ago

We had an offline conversation with @joshdover around improvements.

There is a noticeable amount of SDH issues coming, which end up with a root cause, or one of the possible causes, like proxy connectivity issues. The customer has to dive into logs to figure out if the used proxy operates properly (whether connections are established, no 503s, etc.).

I believe we could more proactive and verify the connectivity between Agent and Elasticsearch, Agent and Fleet Server. I was thinking about a special technical policy first to verify all connections and settings, but maybe we can start with picking up the elastic-agent install feedback.

It would definitely help with researching customer problems ("Has your proxy ever worked?" vs "Is there an proxy outage now?").

elastic / kibana

[Fleet] Improve agent observability #78188