elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
18 stars 144 forks source link

[GCP instances]: Windows Agent goes Unhealthy and Healthy inconsistently when installed with Elastic Defend. #4370

Open harshitgupta-qasource opened 8 months ago

harshitgupta-qasource commented 8 months ago

Kibana Build details:

VERSION: 8.13.0
BUILD: 71857
COMMIT: 82f46148c91eec93ac7382147936028db2eb8883

Host OS and Browser version: All, All

Preconditions:

  1. 8.13.0 BC3 Kibana Cloud environment should be available.
  2. 8.13.0 GCP windows agent should be installed with Elastic Defend integration.

Steps to reproduce:

  1. Navigate to Agents Tab
  2. Wait for sometime until agent goes Unhealthy.
  3. Navigate to Agent Details tab.
  4. Observe that windows agent goes Unhealthy and Healthy inconsistently.

Expected:

Screenshot: image image

Agents Logs: elastic-agent-diagnostics-2024-03-07T09-29-57Z-00.zip

elasticmachine commented 8 months ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

harshitgupta-qasource commented 8 months ago

@amolnater-qasource Kindly review

amolnater-qasource commented 8 months ago

Secondary Review for this ticket is Done.

cmacknz commented 8 months ago

It looks like we must have gotten a check-in that did not include the endpoint unit state:

https://github.com/elastic/elastic-agent/blob/f9310c7f664bb25f37d671cca4e2fcf3749732c8/pkg/component/runtime/state.go#L316-L340

cmacknz commented 8 months ago

We are missing the endpoint logs, which would be needed to debug this. That is a separate bug:

harshitgupta-qasource commented 8 months ago

Hi @cmacknz

We have fetched the endpoint logs manually from the host, Please find attached below: log.zip

Thanks

cmacknz commented 8 months ago

It looks like the endpoint logs were captured after the problem occurred, the first timestamp in those logs is after the first set of problems started in the agent logs.

If you are able to reproduce this, the bug preventing endpoint logs from being included in diagnostics has been fixed now.

harshitgupta-qasource commented 8 months ago

Hi @cmacknz

We have re-validated this issue on latest 8.13.0 BC4 Kibana Cloud build and found that the issue is still reproducible.

Please find attached the latest agent logs with endpoint below: elastic-agent-diagnostics-2024-03-14T09-13-58Z-00.zip

Build details: VERSION: 8.13.0 BC4 BUILD: 71907 COMMIT: a44ba7e18a1b025d98f9a789f8177c1f5774d04e Artifact: https://staging.elastic.co/8.13.0-6a2fcd0a/downloads/beats/elastic-agent/elastic-agent-8.13.0-windows-x86_64.zip

Please let us know if anything else is required from our end. Thanks

cmacknz commented 7 months ago

Thanks, but the endpoint logs missing was fixed in BC5, so in those diagnostics the endpoint logs are still missing.

harshitgupta-qasource commented 7 months ago

Hi @cmacknz

Please find attached latest agent logs with elastic defend integration that validated on latest 8.13.0 BC5 kibana cloud build. elastic-agent-diagnostics-2024-03-19T11-15-34Z-00.zip

Build details: VERSION: 8.13.0 BC5 BUILD: 72025 COMMIT: d04713ae6c5fd43086bc488edd406ca50d4b9709 Artifact: https://staging.elastic.co/8.13.0-f75ccb77/summary-8.13.0.html#elastic-agent-package

Thanks

cmacknz commented 7 months ago

Thanks! Those have what I need.

Here's the first occurence in the agent logs:

{"log.level":"warn","@timestamp":"2024-03-19T06:35:00.843Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":606},"message":"Component state changed endpoint-default (HEALTHY->DEGRADED): Degraded: endpoint service missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"endpoint-default","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-03-19T06:35:04.140Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":624},"message":"Unit state changed endpoint-default-06788e77-6fb9-447e-9814-5d05a97d8471 (HEALTHY->FAILED): Failed: not reported in check-in","log":{"source":"elastic-agent"},"component":{"id":"endpoint-default","state":"DEGRADED"},"unit":{"id":"endpoint-default-06788e77-6fb9-447e-9814-5d05a97d8471","type":"input","state":"FAILED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-03-19T06:35:04.140Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":624},"message":"Unit state changed endpoint-default (HEALTHY->FAILED): Failed: not reported in check-in","log":{"source":"elastic-agent"},"component":{"id":"endpoint-default","state":"DEGRADED"},"unit":{"id":"endpoint-default","type":"output","state":"FAILED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-19T06:35:07.600Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":624},"message":"Unit state changed endpoint-default-06788e77-6fb9-447e-9814-5d05a97d8471 (FAILED->HEALTHY): Applied policy {06788e77-6fb9-447e-9814-5d05a97d8471}","log":{"source":"elastic-agent"},"component":{"id":"endpoint-default","state":"DEGRADED"},"unit":{"id":"endpoint-default-06788e77-6fb9-447e-9814-5d05a97d8471","type":"input","state":"HEALTHY","old_state":"FAILED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-19T06:35:07.600Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":624},"message":"Unit state changed endpoint-default (FAILED->HEALTHY): Applied policy {06788e77-6fb9-447e-9814-5d05a97d8471}","log":{"source":"elastic-agent"},"component":{"id":"endpoint-default","state":"DEGRADED"},"unit":{"id":"endpoint-default","type":"output","state":"HEALTHY","old_state":"FAILED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-19T06:35:30.844Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":606},"message":"Component state changed endpoint-default (DEGRADED->HEALTHY): Healthy: communicating with endpoint service","log":{"source":"elastic-agent"},"component":{"id":"endpoint-default","state":"HEALTHY","old_state":"DEGRADED"},"ecs.version":"1.6.0"}

Here are the endpoint logs around this time:

{"@timestamp":"2024-03-19T06:34:59.2030218Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":2511,"name":"Config.cpp"}}},"message":"Config.cpp:2511 Checking for agent connectivity","process":{"pid":11436,"thread":{"id":2108}}}
{"@timestamp":"2024-03-19T06:35:00.217592Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":2511,"name":"Config.cpp"}}},"message":"Config.cpp:2511 Checking for agent connectivity","process":{"pid":11436,"thread":{"id":2108}}}
{"@timestamp":"2024-03-19T06:35:01.2233155Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":2511,"name":"Config.cpp"}}},"message":"Config.cpp:2511 Checking for agent connectivity","process":{"pid":11436,"thread":{"id":2108}}}
{"@timestamp":"2024-03-19T06:35:02.2344486Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":2511,"name":"Config.cpp"}}},"message":"Config.cpp:2511 Checking for agent connectivity","process":{"pid":11436,"thread":{"id":2108}}}
{"@timestamp":"2024-03-19T06:35:03.2392269Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":2511,"name":"Config.cpp"}}},"message":"Config.cpp:2511 Checking for agent connectivity","process":{"pid":11436,"thread":{"id":2108}}}
{"@timestamp":"2024-03-19T06:35:04.1375723Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":1048,"name":"AgentComms.cpp"}}},"message":"AgentComms.cpp:1048 Agent connection established.","process":{"pid":11436,"thread":{"id":11612}}}
{"@timestamp":"2024-03-19T06:35:04.1375723Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":56,"name":"BigChief.cpp"}}},"message":"BigChief.cpp:56 Agent process notification for protection: 5052","process":{"pid":11436,"thread":{"id":11612}}}
{"@timestamp":"2024-03-19T06:35:04.1375723Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":77,"name":"BigChief.cpp"}}},"message":"BigChief.cpp:77 Agent process [5052] protection result: Success","process":{"pid":11436,"thread":{"id":11612}}}
{"@timestamp":"2024-03-19T06:35:04.1375723Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":309,"name":"AgentContext.cpp"}}},"message":"AgentContext.cpp:309 Successfully created reader/writer for CheckinV2Api","process":{"pid":11436,"thread":{"id":10040}}}
{"@timestamp":"2024-03-19T06:35:04.1375723Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"warning","origin":{"file":{"line":478,"name":"AgentContext.cpp"}}},"message":"AgentContext.cpp:478 Endpoint is setting status to CONFIGURING, reason: Policy Application Status","process":{"pid":11436,"thread":{"id":10792}}}
{"@timestamp":"2024-03-19T06:35:04.142431Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":430,"name":"AgentV2PolicyBuilder.cpp"}}},"message":"AgentV2PolicyBuilder.cpp:430 Received new configuration from agent, building new config","process":{"pid":11436,"thread":{"id":10040}}}
cmacknz commented 7 months ago

It happens again here:

{"log.level":"warn","@timestamp":"2024-03-19T06:44:31.000Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":606},"message":"Component state changed endpoint-default (HEALTHY->DEGRADED): Degraded: endpoint service missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"endpoint-default","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-03-19T06:44:40.120Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":624},"message":"Unit state changed endpoint-default-06788e77-6fb9-447e-9814-5d05a97d8471 (HEALTHY->FAILED): Failed: not reported in check-in","log":{"source":"elastic-agent"},"component":{"id":"endpoint-default","state":"DEGRADED"},"unit":{"id":"endpoint-default-06788e77-6fb9-447e-9814-5d05a97d8471","type":"input","state":"FAILED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-03-19T06:44:40.120Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":624},"message":"Unit state changed endpoint-default (HEALTHY->FAILED): Failed: not reported in check-in","log":{"source":"elastic-agent"},"component":{"id":"endpoint-default","state":"DEGRADED"},"unit":{"id":"endpoint-default","type":"output","state":"FAILED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-19T06:44:44.318Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":624},"message":"Unit state changed endpoint-default-06788e77-6fb9-447e-9814-5d05a97d8471 (FAILED->HEALTHY): Applied policy {06788e77-6fb9-447e-9814-5d05a97d8471}","log":{"source":"elastic-agent"},"component":{"id":"endpoint-default","state":"DEGRADED"},"unit":{"id":"endpoint-default-06788e77-6fb9-447e-9814-5d05a97d8471","type":"input","state":"HEALTHY","old_state":"FAILED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-19T06:44:44.318Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":624},"message":"Unit state changed endpoint-default (FAILED->HEALTHY): Applied policy {06788e77-6fb9-447e-9814-5d05a97d8471}","log":{"source":"elastic-agent"},"component":{"id":"endpoint-default","state":"DEGRADED"},"unit":{"id":"endpoint-default","type":"output","state":"HEALTHY","old_state":"FAILED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-19T06:45:01.001Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":606},"message":"Component state changed endpoint-default (DEGRADED->HEALTHY): Healthy: communicating with endpoint service","log":{"source":"elastic-agent"},"component":{"id":"endpoint-default","state":"HEALTHY","old_state":"DEGRADED"},"ecs.version":"1.6.0"}

The matching endpoint logs are:

{"@timestamp":"2024-03-19T06:44:27.7973601Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":406,"name":"Response.cpp"}}},"message":"Response.cpp:406 Policy action configure_response_actions: success - Successfully configured fleet API for response actions","process":{"pid":10052,"thread":{"id":2168}}}
{"@timestamp":"2024-03-19T06:44:27.7973601Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":2425,"name":"Config.cpp"}}},"message":"Config.cpp:2425 Configuring metadata","process":{"pid":10052,"thread":{"id":2168}}}
{"@timestamp":"2024-03-19T06:44:27.7973601Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":2425,"name":"Config.cpp"}}},"message":"Config.cpp:2425 Configuring cloudServices","process":{"pid":10052,"thread":{"id":2168}}}
{"@timestamp":"2024-03-19T06:44:27.8135229Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":2511,"name":"Config.cpp"}}},"message":"Config.cpp:2511 Checking for agent connectivity","process":{"pid":10052,"thread":{"id":2168}}}
{"@timestamp":"2024-03-19T06:44:28.8348852Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":2511,"name":"Config.cpp"}}},"message":"Config.cpp:2511 Checking for agent connectivity","process":{"pid":10052,"thread":{"id":2168}}}
{"@timestamp":"2024-03-19T06:44:29.8352501Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":2511,"name":"Config.cpp"}}},"message":"Config.cpp:2511 Checking for agent connectivity","process":{"pid":10052,"thread":{"id":2168}}}
{"@timestamp":"2024-03-19T06:44:30.8406001Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":2511,"name":"Config.cpp"}}},"message":"Config.cpp:2511 Checking for agent connectivity","process":{"pid":10052,"thread":{"id":2168}}}
{"@timestamp":"2024-03-19T06:44:31.8550782Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":2511,"name":"Config.cpp"}}},"message":"Config.cpp:2511 Checking for agent connectivity","process":{"pid":10052,"thread":{"id":2168}}}
{"@timestamp":"2024-03-19T06:44:32.7353612Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":572,"name":"LogStorage.cpp"}}},"message":"LogStorage.cpp:572 Unsynced document log store: C:\\Program Files\\Elastic\\Endpoint\\state\\documents\\documents-2024-03-19T063449.log","process":{"pid":10052,"thread":{"id":2248}}}
{"@timestamp":"2024-03-19T06:44:32.8710733Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":2511,"name":"Config.cpp"}}},"message":"Config.cpp:2511 Checking for agent connectivity","process":{"pid":10052,"thread":{"id":2168}}}
{"@timestamp":"2024-03-19T06:44:33.8807365Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":2511,"name":"Config.cpp"}}},"message":"Config.cpp:2511 Checking for agent connectivity","process":{"pid":10052,"thread":{"id":2168}}}
{"@timestamp":"2024-03-19T06:44:34.8879664Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":2511,"name":"Config.cpp"}}},"message":"Config.cpp:2511 Checking for agent connectivity","process":{"pid":10052,"thread":{"id":2168}}}
{"@timestamp":"2024-03-19T06:44:35.8966438Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":2511,"name":"Config.cpp"}}},"message":"Config.cpp:2511 Checking for agent connectivity","process":{"pid":10052,"thread":{"id":2168}}}
{"@timestamp":"2024-03-19T06:44:36.9029597Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":2511,"name":"Config.cpp"}}},"message":"Config.cpp:2511 Checking for agent connectivity","process":{"pid":10052,"thread":{"id":2168}}}
{"@timestamp":"2024-03-19T06:44:37.760251Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":572,"name":"LogStorage.cpp"}}},"message":"LogStorage.cpp:572 Unsynced document log store: C:\\Program Files\\Elastic\\Endpoint\\state\\documents\\documents-2024-03-19T063449.log","process":{"pid":10052,"thread":{"id":2248}}}
{"@timestamp":"2024-03-19T06:44:37.9042636Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":2511,"name":"Config.cpp"}}},"message":"Config.cpp:2511 Checking for agent connectivity","process":{"pid":10052,"thread":{"id":2168}}}
{"@timestamp":"2024-03-19T06:44:38.9114662Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":2511,"name":"Config.cpp"}}},"message":"Config.cpp:2511 Checking for agent connectivity","process":{"pid":10052,"thread":{"id":2168}}}
{"@timestamp":"2024-03-19T06:44:39.9250488Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":2511,"name":"Config.cpp"}}},"message":"Config.cpp:2511 Checking for agent connectivity","process":{"pid":10052,"thread":{"id":2168}}}
{"@timestamp":"2024-03-19T06:44:40.1179011Z","agent":{"id":"998d041a-b62b-4fd7-8fda-945a3d40cd52","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":1048,"name":"AgentComms.cpp"}}},"message":"AgentComms.cpp:1048 Agent connection established.","process":{"pid":10052,"thread":{"id":11028}}}
cmacknz commented 7 months ago

I see endpoint reconnected to agent 26 times in the endpoint logs which span ~5 hours from 2024-03-19T06:23:00.684 to 2024-03-19T11:15:34.826. That seems suspicious.

❯ rg 'Connecting to Agent' logs/services --count
logs/services/endpoint-000000.log:26

@brian-mckinney any idea what might be going on here on the endpoint side? Diagnostics and logs are in https://github.com/elastic/elastic-agent/issues/4370#issuecomment-2006970839.