[Elastic Agent] Fleet managed Elastic Agent stays healthy when it can't connect to Fleet Server

BenB196 commented 2 years ago

Version: 7.16.2
Operating System: CentOS 7 (but issue can probably happen with any Fleet managed Agent)

Steps to Reproduce:

Install Fleet Managed Elastic Agent on a system. (.tar.gz install)
Wait for the Agent to register and become healthy
Interrupt connection between Elastic Agent and Fleet Server

Wait for Elastic Agent to start reporting connection error logs:

{"log.level":"error","@timestamp":"2022-02-01T15:23:52.311Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post \"https://fleet-server.example.com:8220/api/fleet/agents/<agent_id>/checkin?\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)","ecs.version":"1.6.0"}

Check Elastic Agent Status, and see that it is healthy:

sudo /opt/Elastic/Agent/elastic-agent status
Status: HEALTHY
Message: (no message)
Applications:
* endpoint-security      (HEALTHY)
                   Protecting with policy {2dc16c0f-ebc4-400a-bd5a-7afacc5e4370}
* filebeat               (HEALTHY)
                   Running
* metricbeat             (HEALTHY)
                   Running
* filebeat_monitoring    (HEALTHY)
                   Running
* metricbeat_monitoring  (HEALTHY)
                   Running
* osquerybeat            (HEALTHY)
                   Running

Expected:

I'd expect the top level status to be UNHEALTHY as the Agent can no longer talk to Fleet server and therefore no longer pull policy updates, or do other things that require contact with the Fleet Server.

Issue:

This problem matters because it makes it hard to detect when the Elastic Agent itself enters a state that it can no longer properly function in. While I can see the agent Offline in Kibana because it hasn't checked in recently. If I have a tool like Puppet which periodically checks agents that are unhealthy and attempts to fix them, I run into an issue, where Puppet wouldn't be able to detect this issue, and therefore not automatically fix the issue, requiring manual intervention.

elasticmachine commented 2 years ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

ph commented 2 years ago

I think we aren't correctly aggregating the internal fleet status with the overall Agent status.

blakerouse commented 2 years ago

Yes the local heathly status is not propogated to the status subcommand. It would show unhealthy in Fleet UI.

Something we should look at improving.

iamjosh007 commented 2 years ago

any updates on this??, Seems like the whole fleet functionality within the Stack is buggy and need some immediate fixes on all fronts.

elastic / elastic-agent

[Elastic Agent] Fleet managed Elastic Agent stays healthy when it can't connect to Fleet Server #87

Expected:

Issue: