Closed amolnater-qasource closed 1 year ago
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)
@manishgupta-qasource Please review.
Secondary review for this ticket is Done
"components": [
{
"id": "log-default",
"type": "log",
"status": "HEALTHY",
"message": "Healthy: communicating with pid '128'",
"units": [
{
"id": "log-default",
"type": "output",
"status": "HEALTHY",
"message": "Healthy"
},
{
"id": "log-default-logfile-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a",
"type": "input",
"status": "HEALTHY",
"message": "Healthy"
}
]
},
{
"id": "winlog-default",
"type": "winlog",
"status": "DEGRADED",
"message": "Degraded: pid '4900' missed 1 check-in",
"units": [
{
"id": "winlog-default-winlog-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a",
"type": "input",
"status": "HEALTHY",
"message": "Healthy"
},
{
"id": "winlog-default",
"type": "output",
"status": "HEALTHY",
"message": "Healthy"
}
]
},
{
"id": "system/metrics-default",
"type": "system/metrics",
"status": "HEALTHY",
"message": "Healthy: communicating with pid '916'",
"units": [
{
"id": "system/metrics-default",
"type": "output",
"status": "HEALTHY",
"message": "Healthy"
},
{
"id": "system/metrics-default-system/metrics-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a",
"type": "input",
"status": "HEALTHY",
"message": "Healthy"
}
]
},
{
"id": "filestream-monitoring",
"type": "filestream",
"status": "DEGRADED",
"message": "Degraded: pid '7040' missed 1 check-in",
"units": [
{
"id": "filestream-monitoring-filestream-monitoring-agent",
"type": "input",
"status": "HEALTHY",
"message": "Healthy"
},
{
"id": "filestream-monitoring",
"type": "output",
"status": "HEALTHY",
"message": "Healthy"
}
]
},
{
"id": "beat/metrics-monitoring",
"type": "beat/metrics",
"status": "HEALTHY",
"message": "Healthy: communicating with pid '1552'",
"units": [
{
"id": "beat/metrics-monitoring-metrics-monitoring-beats",
"type": "input",
"status": "CONFIGURING",
"message": "Configuring"
},
{
"id": "beat/metrics-monitoring",
"type": "output",
"status": "HEALTHY",
"message": "Healthy"
}
]
},
{
"id": "http/metrics-monitoring",
"type": "http/metrics",
"status": "DEGRADED",
"message": "Degraded: pid '6044' missed 1 check-in",
"units": [
{
"id": "http/metrics-monitoring-metrics-monitoring-agent",
"type": "input",
"status": "CONFIGURING",
"message": "Configuring"
},
{
"id": "http/metrics-monitoring",
"type": "output",
"status": "HEALTHY",
"message": "Healthy"
}
]
}
],
The diagnostics shows everything is healthy but the snapshot of the agent details while this was happening shows that multiple processes are missing check ins.
Usually missed check ins happen due to increased CPU usage on the machine. The policy reassignment would have caused every input in the policy to stop and restart, and the missed checkins correlate with that event in the logs:
"log.level":"debug","@timestamp":"2023-10-17T09:16:42.412Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service.go","file.line":192},"message":"context is done. exiting.","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8 (STOPPED->STOPPING): Stopping","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPING","old_state":"STOPPED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (CONFIGURING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"HEALTHY","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (CONFIGURING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"http/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"http/metrics-monitoring-metrics-monitoring-agent","type":"input","state":"HEALTHY","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed filestream-monitoring-filestream-monitoring-agent (CONFIGURING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"filestream-monitoring","state":"HEALTHY"},"unit":{"id":"filestream-monitoring-filestream-monitoring-agent","type":"input","state":"HEALTHY","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed winlog-default-winlog-system-255f7854-c0d1-46f4-b861-021bac628fd8 (STOPPING->STOPPED): Stopped","log":{"source":"elastic-agent"},"component":{"id":"winlog-default","state":"HEALTHY"},"unit":{"id":"winlog-default-winlog-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED","old_state":"STOPPING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed winlog-default-winlog-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a (STARTING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"winlog-default","state":"HEALTHY"},"unit":{"id":"winlog-default-winlog-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a","type":"input","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed system/metrics-default-system/metrics-system-255f7854-c0d1-46f4-b861-021bac628fd8 (STOPPING->STOPPED): Stopped","log":{"source":"elastic-agent"},"component":{"id":"system/metrics-default","state":"HEALTHY"},"unit":{"id":"system/metrics-default-system/metrics-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED","old_state":"STOPPING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed system/metrics-default-system/metrics-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a (STARTING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"system/metrics-default","state":"HEALTHY"},"unit":{"id":"system/metrics-default-system/metrics-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a","type":"input","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8 (STOPPING->STOPPED): Stopped","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED","old_state":"STOPPING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed log-default-logfile-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a (STARTING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a","type":"input","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed filestream-monitoring (HEALTHY->DEGRADED): Degraded: pid '7040' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"filestream-monitoring","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed http/metrics-monitoring (HEALTHY->DEGRADED): Degraded: pid '6044' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"http/metrics-monitoring","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":550},"message":"Spawned new unit system/metrics-default-system/metrics-system-255f7854-c0d1-46f4-b861-021bac628fd8: Stopped","log":{"source":"elastic-agent"},"component":{"id":"system/metrics-default","state":"HEALTHY"},"unit":{"id":"system/metrics-default-system/metrics-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":550},"message":"Spawned new unit log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8: Stopped","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed winlog-default (HEALTHY->DEGRADED): Degraded: pid '4900' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"winlog-default","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":550},"message":"Spawned new unit log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8: Stopped","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":550},"message":"Spawned new unit winlog-default-winlog-system-255f7854-c0d1-46f4-b861-021bac628fd8: Stopped","log":{"source":"elastic-agent"},"component":{"id":"winlog-default","state":"DEGRADED"},"unit":{"id":"winlog-default-winlog-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":550},"message":"Spawned new unit system/metrics-default-system/metrics-system-255f7854-c0d1-46f4-b861-021bac628fd8: Stopped","log":{"source":"elastic-agent"},"component":{"id":"system/metrics-default","state":"HEALTHY"},"unit":{"id":"system/metrics-default-system/metrics-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":550},"message":"Spawned new unit winlog-default-winlog-system-255f7854-c0d1-46f4-b861-021bac628fd8: Stopped","log":{"source":"elastic-agent"},"component":{"id":"winlog-default","state":"DEGRADED"},"unit":{"id":"winlog-default-winlog-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED"},"ecs.version":"1.6.0"}
Does this happen every time? If it does happen every time can it be reproduced on the latest 8.10.4 release? Are you able to check or record the CPU usage of the machine running the agent when the policy reassignment is happening?
Hi @cmacknz
Thank you for looking into this.
We have validated this issue on 8.10.4 released production environment and found it not reproducible there. We have also revalidated this issue on 8.11.0 BC3 kibana cloud environment and found it consistently reproducible there.
Further observations:
Screen Recordings: 8.10.4:
https://github.com/elastic/elastic-agent/assets/77374876/2f08aeec-4ad2-4b28-a67d-5bfdb6399111
https://github.com/elastic/elastic-agent/assets/77374876/db8356e6-e883-4c0d-90de-72fa6bd7c3a0
8.11.0:
https://github.com/elastic/elastic-agent/assets/77374876/87f500e3-9eb1-4277-8390-e736f5fb7d08
https://github.com/elastic/elastic-agent/assets/77374876/5cd0c698-b411-41ee-9e35-d8b3d3131d1f
Please let us know if anything else is required from our end.
Thanks!!
When the CPU reaches 100%, can you run elastic-agent diagnostics --cpu-profile
and attach the results here?
Hi @cmacknz
We have reproduced this issue and recollected the new logs when CPU reached 100% with the shared command.
Logs: elastic-agent-diagnostics-2023-10-25T11-34-54Z-00.zip
Please let us know if anything else is required from our end.
Thanks!!
The latest diagnostics isn't giving me any clues but likely this is the same problem as https://github.com/elastic/elastic-agent/issues/3654
Pulling this into the sprint and assigning to Lee since this is the same as https://github.com/elastic/elastic-agent/issues/3654 but we'll need to retest this scenario to confirm.
Hi Team,
We have revalidated this issue on latest 8.12.0 BC5 kibana cloud environment and found it fixed now.
Observations:
Build details: VERSION: 8.12.0 BC5 BUILD: 70053 COMMIT: db9b8921b37139cbb1e11d23f6381f655edeb72b Artifact Link: https://staging.elastic.co/8.12.0-9f05a310/summary-8.12.0.html
Screenshot:
Hence we are marking this issue as QA:Validated.
Thanks!
Bug Conversion:
We have updated 01 testcase for this scenario in our fleet test suite at:
Thanks
Kibana Build details:
Host OS: All
Preconditions:
Steps to reproduce:
Screen Recording:
https://github.com/elastic/elastic-agent/assets/77374876/3442901e-4afa-4a35-8199-e6193c52912e
https://github.com/elastic/elastic-agent/assets/77374876/ca9125ed-6ec8-45cf-b4a0-2bec588a234e
Agent Logs: elastic-agent-diagnostics-2023-10-17T09-17-18Z-00.zip
Agent JSON: ec2amaz-auvpg7n-agent-details.zip
Expected Result: Agent shouldn't get unhealthy on assigning from policy with Elastic Defend integration to without Defend integration.
Impacted Testcase: https://elastic.testrail.io/index.php?/tests/view/2166951
Similar issue: https://github.com/elastic/elastic-agent/issues/3507