Closed amolnater-qasource closed 1 month ago
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)
@manishgupta-qasource Please review.
Looks like Agent bug. I can see Agent successfully installing Endpoint, then detecting error on the bootstrap pipe and uninstalling Endpoint. I'm not sure if the bootstrap pipe "error" was the reason of Endpoint removal, we can see "endpoint service has checked in, send stopping state to service" so those services did connected to each over on the bootstrap pipe.
{"log.level":"error","@timestamp":"2024-07-09T11:15:47.300Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.executeCommand.func2","file.name":"runtime/service_command.go","file.line":69},"message":"2024-07-09 11:15:47: info: Exec.cpp:1177 Successfully ran /bin/launchctl load -w /Library/LaunchDaemons/co.elastic.endpoint.plist","context":"command output","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-07-09T11:17:43.731Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.newConnInfoServer.func1","file.name":"runtime/conn_info_server.go","file.line":56},"message":"failed accept conn info connection: accept unix /Library/Elastic/Agent/.eaci.sock: use of closed network connection","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-07-09T11:17:43.731Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.(*serviceRuntime).stop","file.name":"runtime/service.go","file.line":376},"message":"stopping endpoint service runtime","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-07-09T11:17:43.731Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.(*serviceRuntime).stop","file.name":"runtime/service.go","file.line":392},"message":"endpoint service has checked in, send stopping state to service","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-07-09T11:17:43.731Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.(*serviceRuntime).stop","file.name":"runtime/service.go","file.line":400},"message":"uninstall endpoint service","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-07-09T11:17:43.993Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.executeCommand.func2","file.name":"runtime/service_command.go","file.line":69},"message":"2024-07-09 11:17:43: info: MainPosix.cpp:262 Executing uninstall","context":"command output","ecs.version":"1.6.0"}
Secondary Review for this ticket is Done
It looks to me like we only got a policy change to uninstall endpoint, but not one for re-installing it. The last change in the sequence contains "removed":["log-default","system/metrics-default","endpoint-default"
.
{"log.level":"info","@timestamp":"2024-07-09T11:15:39.177Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).checkAndLogUpdate","file.name":"coordinator/coordinator.go","file.line":1479},"message":"component model updated","log":{"source":"elastic-agent"},"changes":{"components":{"updated":["beat/metrics-monitoring: [(beat/metrics-monitoring-metrics-monitoring-beats: updated)]","http/metrics-monitoring: [(http/metrics-monitoring-metrics-monitoring-agent: updated)]"],"count":4},"outputs":{}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-07-09T11:15:47.414Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).checkAndLogUpdate","file.name":"coordinator/coordinator.go","file.line":1479},"message":"component model updated","log":{"source":"elastic-agent"},"changes":{"components":{"updated":["beat/metrics-monitoring: [(beat/metrics-monitoring-metrics-monitoring-beats: updated)]","http/metrics-monitoring: [(http/metrics-monitoring-metrics-monitoring-agent: updated)]"],"count":6},"outputs":{}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-07-09T11:16:11.718Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).checkAndLogUpdate","file.name":"coordinator/coordinator.go","file.line":1479},"message":"component model updated","log":{"source":"elastic-agent"},"changes":{"components":{"added":["system/metrics-monitoring"],"updated":["beat/metrics-monitoring: [(beat/metrics-monitoring-metrics-monitoring-beats: updated)]","http/metrics-monitoring: [(http/metrics-monitoring-metrics-monitoring-agent: updated)]"],"count":7},"outputs":{}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-07-09T11:17:42.999Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).checkAndLogUpdate","file.name":"coordinator/coordinator.go","file.line":1479},"message":"component model updated","log":{"source":"elastic-agent"},"changes":{"components":{"removed":["log-default","system/metrics-default","endpoint-default","system/metrics-monitoring"],"updated":["http/metrics-monitoring: [(http/metrics-monitoring-metrics-monitoring-agent: updated)]","filestream-monitoring: [(filestream-monitoring-filestream-monitoring-agent: updated)]","beat/metrics-monitoring: [(beat/metrics-monitoring-metrics-monitoring-beats: updated)]"],"count":3},"outputs":{}},"ecs.version":"1.6.0"}
I had thought this might be a problem re-binding to the unix socket, but we have logic to remove it before creating and binding to it again: https://github.com/elastic/elastic-agent/blob/d09ef623edfda136dd7392042f2020e4326a45c9/pkg/ipc/listener.go#L27-L33
Problems re-binding to the unix socket would also be reproducible on Linux if this were the root cause.
The policy in the diagnostics is actually missing endpoint as suspected, so we uninstalled endpoint as instructed, and at the point in time in the diagnostics, agent wasn't told to install it again.
agent:
download:
sourceURI: https://staging.elastic.co/8.15.0-c7717606/downloads/
features: null
monitoring:
enabled: true
logs: true
metrics: true
namespace: mac
use_output: default
protection:
enabled: false
signing_key: <REDACTED>
uninstall_token_hash: <REDACTED>
fleet:
hosts:
- https://0ea335ba862e49e18458a3ef58111cc8.fleet.us-west2.gcp.elastic-cloud.com:443
host:
id: 298365BA-DC99-512C-9E42-026403F89999
id: 30e3cb63-3e88-448d-ad4b-a392fe117849
outputs:
default:
api_key: <REDACTED>
hosts:
- https://aa6a148468024c99b25c706c496ece04.us-west2.gcp.elastic-cloud.com:443
preset: balanced
type: elasticsearch
path:
config: /Library/Elastic/Agent
data: /Library/Elastic/Agent/data
home: /Library/Elastic/Agent/data/elastic-agent-8.15.0-b7f8e2
logs: /Library/Elastic/Agent
revision: 1
runtime:
arch: arm64
native_arch: arm64
os: darwin
osinfo:
family: darwin
major: 14
minor: 5
patch: 0
type: macos
version: "14.5"
signed:
data: eyJpZCI6IjMwZTNjYjYzLTNlODgtNDQ4ZC1hZDRiLWEzOTJmZTExNzg0OSIsImFnZW50Ijp7ImZlYXR1cmVzIjp7fSwicHJvdGVjdGlvbiI6eyJlbmFibGVkIjpmYWxzZSwidW5pbnN0YWxsX3Rva2VuX2hhc2giOiJINEdsamk0SGI3UlQyVVVLOC9jakxFTlFzVzFHSUx6eW1ac0o4c3NVWllBPSIsInNpZ25pbmdfa2V5IjoiTUZrd0V3WUhLb1pJemowQ0FRWUlLb1pJemowREFRY0RRZ0FFdER3MHVRWWRySTg1akdSNzBsTGhWNjdJOHcrc0VxL2RQQmNnVlozV2xpRHF6c1hobCtLVHBMNFNhTStYN3FldUNHRWZrUnk4NU9vK1pGN01Od3ArZFE9PSJ9fSwiaW5wdXRzIjpbXX0=
signature: MEYCIQCTZ9D2JuolFBk2W5REVzC3Yo8+DEDyk4+9YCmX7dSumQIhAKaWnb8xuk9lqhydJEdg2Dtm9IqNv57W+faf3CMS/r3U
I just tried this and the policy updated on the agent almost immediately. @amolnater-qasource is this reproducible for you?
sudo elastic-agent status --output=full
┌─ fleet
│ └─ status: (HEALTHY) Connected
└─ elastic-agent
├─ status: (HEALTHY) Running
├─ info
│ ├─ id: 56f54ff3-e21e-4be8-b193-e9b669e3177a
│ ├─ version: 8.15.0
│ └─ commit: b7f8e2a061203c997df73ccc9e4447e3184907db
├─ endpoint-default
│ ├─ status: (HEALTHY) Healthy: communicating with endpoint service
│ ├─ endpoint-default
│ │ ├─ status: (HEALTHY) Applied policy {c6b1c82c-3466-4590-a533-90a7777c2cea}
│ │ └─ type: OUTPUT
│ └─ endpoint-default-c6b1c82c-3466-4590-a533-90a7777c2cea
│ ├─ status: (HEALTHY) Applied policy {c6b1c82c-3466-4590-a533-90a7777c2cea}
│ └─ type: INPUT
❯ sudo elastic-agent status --output=full
┌─ fleet
│ └─ status: (HEALTHY) Connected
└─ elastic-agent
├─ status: (HEALTHY) Running
├─ info
│ ├─ id: 56f54ff3-e21e-4be8-b193-e9b669e3177a
│ ├─ version: 8.15.0
│ └─ commit: b7f8e2a061203c997df73ccc9e4447e3184907db
├─ endpoint-default
│ ├─ status: (HEALTHY) Healthy: communicating with endpoint service
│ ├─ endpoint-default
│ │ ├─ status: (HEALTHY) Applied policy {dcbd8b3f-4081-46c4-95fa-bc48263543d3}
│ │ └─ type: OUTPUT
│ └─ endpoint-default-dcbd8b3f-4081-46c4-95fa-bc48263543d3
│ ├─ status: (HEALTHY) Applied policy {dcbd8b3f-4081-46c4-95fa-bc48263543d3
Hi @cmacknz
Thank you for looking into this issue.
We have revalidated this issue on 8.15.0 BC1 and please find more observations for the same:
Agent installed was installed with: elastic-agent.zip
Agent assigned to new Policy(These policies are duplicated from the policy with which the agent was installed.): elastic-agent (1).zip elastic-agent (2).zip
Agent diagnostics: elastic-agent-diagnostics-2024-07-10T05-28-18Z-00.zip
Could you please confirm if there's any conflict with the shared policies?
Please let us know if anything else is required from our end. Thanks!
I see the same as above, we are only getting the last policy update removing everything:
{"log.level":"info","@timestamp":"2024-07-10T05:27:25.719Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).checkAndLogUpdate","file.name":"coordinator/coordinator.go","file.line":1479},"message":"component model updated","log":{"source":"elastic-agent"},"changes":{"components":{"removed":["system/metrics-monitoring","endpoint-default","log-default","system/metrics-default"],"updated":["beat/metrics-monitoring: [(beat/metrics-monitoring-metrics-monitoring-beats: updated)]","http/metrics-monitoring: [(http/metrics-monitoring-metrics-monitoring-agent: updated)]","filestream-monitoring: [(filestream-monitoring-filestream-monitoring-agent: updated)]"],"count":3},"outputs":{}},"ecs.version":"1.6.0"}
I tried to reproduce the issue, but couldn't yet, the agent receives the endpoint input as expected after policy reassign.
@amolnater-qasource Could we get the output of these queries from Kibana console if the deployment is still running? Or if you can share the deployment ID, I can look at it in admin.
GET .fleet-actions/_search?size=100
GET .fleet-policies/_search?size=100
GET kbn:/api/fleet/agents/action_status
Hi @juliaElastic Thank you for looking into this issue.
Please find below output for the queries:
GET .fleet-actions/_search?size=100
Actions.txt
GET kbn:/api/fleet/agents/action_status
Status.txt
We tried running GET .fleet-policies/_search?size=100
, it crashed the browser every time on running the query,
Further, we will share the deployment id with you over slack.
Thanks!
Thanks Amol.
Looking at the data, I found a potential bug, here is the policy the agent is reassigned to, it seems that the agent policy tamper protection was switched on, but the policy revision was not bumped, and .fleet-policies
has 2 docs of policy revision_idx:1
, coordinator_idx:1
, one of them doesn't contain inputs
. So fleet-server randomly picks one of them, and it happens to be the one without inputs
.
I'll check the code in kibana.
I could reproduce this locally with 8.15 fleet-server (8.16 doesn't have coordinator anymore, so no duplicate documents are created). It seems to happen only when copying a policy with tamper protection enabled, this is when the policy ends up in revision:1
.
Otherwise we cannot create an agent policy with tamper protection, since we have to add endpoint integration first, and the revision ends up > 1.
Reduced the impact to medium, as I think the copy policy feature is not used that frequently.
GET .fleet-policies/_search?q=30e3cb63-3e88-448d-ad4b-a392fe117849
{
"_index": ".fleet-policies-7",
"_id": "7dd3b7e4-2ae3-516b-990e-8a645bb6c434",
"_score": 4.7706842,
"_source": {
"@timestamp": "2024-07-09T11:09:13.259Z",
"revision_idx": 1,
"coordinator_idx": 0,
"data": {
"id": "30e3cb63-3e88-448d-ad4b-a392fe117849",
"inputs": [
{
"id": "8a9c6a8f-42b4-468a-94a4-2dd3378fcdea",
"revision": 1,
"name": "ED mac (copy)",
"type": "endpoint",
"data_stream": {
"namespace": "mac"
},
"use_output": "default",
"package_policy_id": "8a9c6a8f-42b4-468a-94a4-2dd3378fcdea",
...
},
{
"_index": ".fleet-policies-7",
"_id": "7Pkul5ABaxNQL_pV7eMj",
"_score": 4.7706842,
"_source": {
"coordinator_idx": 1,
"data": {
"fleet": {
"hosts": [
"https://0ea335ba862e49e18458a3ef58111cc8.fleet.us-west2.gcp.elastic-cloud.com:443"
]
},
"id": "30e3cb63-3e88-448d-ad4b-a392fe117849",
},
"default_fleet_server": false,
"policy_id": "30e3cb63-3e88-448d-ad4b-a392fe117849",
"revision_idx": 1,
"@timestamp": "2024-07-09T11:09:10.709Z"
}
},
{
"_index": ".fleet-policies-7",
"_id": "7fkul5ABaxNQL_pV7uMs",
"_score": 4.7706842,
"_source": {
"coordinator_idx": 1,
"data": {
"id": "30e3cb63-3e88-448d-ad4b-a392fe117849",
"inputs": [
{
"id": "8a9c6a8f-42b4-468a-94a4-2dd3378fcdea",
"integration_config": {
"endpointConfig": {
"preset": "EDRComplete"
},
"type": "endpoint"
},
"meta": {
"package": {
"name": "endpoint",
"version": "8.15.0"
}
},
"name": "ED mac (copy)",
"package_policy_id": "8a9c6a8f-42b4-468a-94a4-2dd3378fcdea",
"revision": 1,
"type": "endpoint",
"use_output": "default"
},
},
"default_fleet_server": false,
"policy_id": "30e3cb63-3e88-448d-ad4b-a392fe117849",
"revision_idx": 1,
"@timestamp": "2024-07-09T11:09:13.259Z"
}
}
Pinging @elastic/fleet (Team:Fleet)
Hi Team,
We have revalidated this issue on latest 8.15.0 BC6 kibana cloud environment and found it fixed now.
Observations:
Logs: elastic-agent-diagnostics-2024-08-07T10-37-57Z-00.zip
Build details: VERSION: 8.15.0 BC6 BUILD: 76360 COMMIT: 8aa0b59da12c996e3048d887546667ee6e15c7f
Hence, we are marking this issue as QA:Validated.
Thanks!
Kibana Build details:
Artifact Link: https://staging.elastic.co/8.15.0-c7717606/downloads/beats/elastic-agent/elastic-agent-8.15.0-darwin-aarch64.tar.gz
Preconditions:
Steps to reproduce:
Expected Result: Endpoint should get updated on Assigning agent to new policy with Defend integration.
NOTE:
Screenshot:
Agent Logs: elastic-agent-diagnostics-2024-07-09T11-26-41Z-00.zip
New policy: elastic-agent (1).zip