elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
19 stars 144 forks source link

Agent gets unhealthy on assigning from policy with Elastic Defend integration to without Defend integration. #3617

Closed amolnater-qasource closed 1 year ago

amolnater-qasource commented 1 year ago

Kibana Build details:

VERSION: 8.11.0 BC2
BUILD: 67841
COMMIT: 636a8339cfad92998a5a5adb7be81e3546525ebf

Host OS: All

Preconditions:

  1. 8.11.0 BC2 Kibana cloud environment should be available.
  2. Windows agent should be installed using System and Elastic Defend integration.

Steps to reproduce:

  1. Navigate to Fleet>Agents tab.
  2. Assign the agent to the policy without Elastic Defend integration.
  3. Observe agent gets unhealthy.

Screen Recording:

https://github.com/elastic/elastic-agent/assets/77374876/3442901e-4afa-4a35-8199-e6193c52912e

https://github.com/elastic/elastic-agent/assets/77374876/ca9125ed-6ec8-45cf-b4a0-2bec588a234e

Agent Logs: elastic-agent-diagnostics-2023-10-17T09-17-18Z-00.zip

Agent JSON: ec2amaz-auvpg7n-agent-details.zip

Expected Result: Agent shouldn't get unhealthy on assigning from policy with Elastic Defend integration to without Defend integration.

Impacted Testcase: https://elastic.testrail.io/index.php?/tests/view/2166951

Similar issue: https://github.com/elastic/elastic-agent/issues/3507

elasticmachine commented 1 year ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

amolnater-qasource commented 1 year ago

@manishgupta-qasource Please review.

manishgupta-qasource commented 1 year ago

Secondary review for this ticket is Done

cmacknz commented 1 year ago
"components": [
    {
      "id": "log-default",
      "type": "log",
      "status": "HEALTHY",
      "message": "Healthy: communicating with pid '128'",
      "units": [
        {
          "id": "log-default",
          "type": "output",
          "status": "HEALTHY",
          "message": "Healthy"
        },
        {
          "id": "log-default-logfile-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a",
          "type": "input",
          "status": "HEALTHY",
          "message": "Healthy"
        }
      ]
    },
    {
      "id": "winlog-default",
      "type": "winlog",
      "status": "DEGRADED",
      "message": "Degraded: pid '4900' missed 1 check-in",
      "units": [
        {
          "id": "winlog-default-winlog-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a",
          "type": "input",
          "status": "HEALTHY",
          "message": "Healthy"
        },
        {
          "id": "winlog-default",
          "type": "output",
          "status": "HEALTHY",
          "message": "Healthy"
        }
      ]
    },
    {
      "id": "system/metrics-default",
      "type": "system/metrics",
      "status": "HEALTHY",
      "message": "Healthy: communicating with pid '916'",
      "units": [
        {
          "id": "system/metrics-default",
          "type": "output",
          "status": "HEALTHY",
          "message": "Healthy"
        },
        {
          "id": "system/metrics-default-system/metrics-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a",
          "type": "input",
          "status": "HEALTHY",
          "message": "Healthy"
        }
      ]
    },
    {
      "id": "filestream-monitoring",
      "type": "filestream",
      "status": "DEGRADED",
      "message": "Degraded: pid '7040' missed 1 check-in",
      "units": [
        {
          "id": "filestream-monitoring-filestream-monitoring-agent",
          "type": "input",
          "status": "HEALTHY",
          "message": "Healthy"
        },
        {
          "id": "filestream-monitoring",
          "type": "output",
          "status": "HEALTHY",
          "message": "Healthy"
        }
      ]
    },
    {
      "id": "beat/metrics-monitoring",
      "type": "beat/metrics",
      "status": "HEALTHY",
      "message": "Healthy: communicating with pid '1552'",
      "units": [
        {
          "id": "beat/metrics-monitoring-metrics-monitoring-beats",
          "type": "input",
          "status": "CONFIGURING",
          "message": "Configuring"
        },
        {
          "id": "beat/metrics-monitoring",
          "type": "output",
          "status": "HEALTHY",
          "message": "Healthy"
        }
      ]
    },
    {
      "id": "http/metrics-monitoring",
      "type": "http/metrics",
      "status": "DEGRADED",
      "message": "Degraded: pid '6044' missed 1 check-in",
      "units": [
        {
          "id": "http/metrics-monitoring-metrics-monitoring-agent",
          "type": "input",
          "status": "CONFIGURING",
          "message": "Configuring"
        },
        {
          "id": "http/metrics-monitoring",
          "type": "output",
          "status": "HEALTHY",
          "message": "Healthy"
        }
      ]
    }
  ],

The diagnostics shows everything is healthy but the snapshot of the agent details while this was happening shows that multiple processes are missing check ins.

cmacknz commented 1 year ago

Usually missed check ins happen due to increased CPU usage on the machine. The policy reassignment would have caused every input in the policy to stop and restart, and the missed checkins correlate with that event in the logs:

"log.level":"debug","@timestamp":"2023-10-17T09:16:42.412Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service.go","file.line":192},"message":"context is done. exiting.","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8 (STOPPED->STOPPING): Stopping","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPING","old_state":"STOPPED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (CONFIGURING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"HEALTHY","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (CONFIGURING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"http/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"http/metrics-monitoring-metrics-monitoring-agent","type":"input","state":"HEALTHY","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed filestream-monitoring-filestream-monitoring-agent (CONFIGURING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"filestream-monitoring","state":"HEALTHY"},"unit":{"id":"filestream-monitoring-filestream-monitoring-agent","type":"input","state":"HEALTHY","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed winlog-default-winlog-system-255f7854-c0d1-46f4-b861-021bac628fd8 (STOPPING->STOPPED): Stopped","log":{"source":"elastic-agent"},"component":{"id":"winlog-default","state":"HEALTHY"},"unit":{"id":"winlog-default-winlog-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED","old_state":"STOPPING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed winlog-default-winlog-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a (STARTING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"winlog-default","state":"HEALTHY"},"unit":{"id":"winlog-default-winlog-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a","type":"input","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed system/metrics-default-system/metrics-system-255f7854-c0d1-46f4-b861-021bac628fd8 (STOPPING->STOPPED): Stopped","log":{"source":"elastic-agent"},"component":{"id":"system/metrics-default","state":"HEALTHY"},"unit":{"id":"system/metrics-default-system/metrics-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED","old_state":"STOPPING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed system/metrics-default-system/metrics-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a (STARTING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"system/metrics-default","state":"HEALTHY"},"unit":{"id":"system/metrics-default-system/metrics-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a","type":"input","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8 (STOPPING->STOPPED): Stopped","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED","old_state":"STOPPING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed log-default-logfile-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a (STARTING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a","type":"input","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed filestream-monitoring (HEALTHY->DEGRADED): Degraded: pid '7040' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"filestream-monitoring","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed http/metrics-monitoring (HEALTHY->DEGRADED): Degraded: pid '6044' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"http/metrics-monitoring","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":550},"message":"Spawned new unit system/metrics-default-system/metrics-system-255f7854-c0d1-46f4-b861-021bac628fd8: Stopped","log":{"source":"elastic-agent"},"component":{"id":"system/metrics-default","state":"HEALTHY"},"unit":{"id":"system/metrics-default-system/metrics-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":550},"message":"Spawned new unit log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8: Stopped","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed winlog-default (HEALTHY->DEGRADED): Degraded: pid '4900' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"winlog-default","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":550},"message":"Spawned new unit log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8: Stopped","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":550},"message":"Spawned new unit winlog-default-winlog-system-255f7854-c0d1-46f4-b861-021bac628fd8: Stopped","log":{"source":"elastic-agent"},"component":{"id":"winlog-default","state":"DEGRADED"},"unit":{"id":"winlog-default-winlog-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":550},"message":"Spawned new unit system/metrics-default-system/metrics-system-255f7854-c0d1-46f4-b861-021bac628fd8: Stopped","log":{"source":"elastic-agent"},"component":{"id":"system/metrics-default","state":"HEALTHY"},"unit":{"id":"system/metrics-default-system/metrics-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":550},"message":"Spawned new unit winlog-default-winlog-system-255f7854-c0d1-46f4-b861-021bac628fd8: Stopped","log":{"source":"elastic-agent"},"component":{"id":"winlog-default","state":"DEGRADED"},"unit":{"id":"winlog-default-winlog-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED"},"ecs.version":"1.6.0"}
cmacknz commented 1 year ago

Does this happen every time? If it does happen every time can it be reproduced on the latest 8.10.4 release? Are you able to check or record the CPU usage of the machine running the agent when the policy reassignment is happening?

amolnater-qasource commented 1 year ago

Hi @cmacknz

Thank you for looking into this.

We have validated this issue on 8.10.4 released production environment and found it not reproducible there. We have also revalidated this issue on 8.11.0 BC3 kibana cloud environment and found it consistently reproducible there.

Further observations:

Screen Recordings: 8.10.4:

https://github.com/elastic/elastic-agent/assets/77374876/2f08aeec-4ad2-4b28-a67d-5bfdb6399111

https://github.com/elastic/elastic-agent/assets/77374876/db8356e6-e883-4c0d-90de-72fa6bd7c3a0

8.11.0:

https://github.com/elastic/elastic-agent/assets/77374876/87f500e3-9eb1-4277-8390-e736f5fb7d08

https://github.com/elastic/elastic-agent/assets/77374876/5cd0c698-b411-41ee-9e35-d8b3d3131d1f

Please let us know if anything else is required from our end.

Thanks!!

cmacknz commented 1 year ago

When the CPU reaches 100%, can you run elastic-agent diagnostics --cpu-profile and attach the results here?

amolnater-qasource commented 1 year ago

Hi @cmacknz

We have reproduced this issue and recollected the new logs when CPU reached 100% with the shared command.

Logs: elastic-agent-diagnostics-2023-10-25T11-34-54Z-00.zip

Please let us know if anything else is required from our end.

Thanks!!

cmacknz commented 1 year ago

The latest diagnostics isn't giving me any clues but likely this is the same problem as https://github.com/elastic/elastic-agent/issues/3654

cmacknz commented 1 year ago

Pulling this into the sprint and assigning to Lee since this is the same as https://github.com/elastic/elastic-agent/issues/3654 but we'll need to retest this scenario to confirm.

amolnater-qasource commented 10 months ago

Hi Team,

We have revalidated this issue on latest 8.12.0 BC5 kibana cloud environment and found it fixed now.

Observations:

Build details: VERSION: 8.12.0 BC5 BUILD: 70053 COMMIT: db9b8921b37139cbb1e11d23f6381f655edeb72b Artifact Link: https://staging.elastic.co/8.12.0-9f05a310/summary-8.12.0.html

Screenshot: image

Hence we are marking this issue as QA:Validated.

Thanks!

harshitgupta-qasource commented 9 months ago

Bug Conversion:

We have updated 01 testcase for this scenario in our fleet test suite at:

Thanks