elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.72k stars 8.13k forks source link

[MAC aarch64]: Endpoint doesn't get updated on Assigning agent to new policy with Defend integration. #188929

Closed amolnater-qasource closed 1 month ago

amolnater-qasource commented 2 months ago

Kibana Build details:

VERSION: 8.15.0 BC1
BUILD: 76008
COMMIT: c616ed3da09e04c766be0d791373dc78c1231e12

Artifact Link: https://staging.elastic.co/8.15.0-c7717606/downloads/beats/elastic-agent/elastic-agent-8.15.0-darwin-aarch64.tar.gz

Preconditions:

  1. 8.15.0 BC1 Kibana cloud environment should be available.
  2. MAC aarch64 agent should be installed with Elastic Defend integration.

Steps to reproduce:

  1. Navigate to Agents tab.
  2. Assign agent to different policy having Elastic Defend integration.
  3. Observe MAC endpoint didn't get update and remains Out of date for over 10 minutes under Endpoints tab.
  4. Further observe Endpoint folder gets removed from installation directory.

Expected Result: Endpoint should get updated on Assigning agent to new policy with Defend integration.

NOTE:

Screenshot: 1 2

Agent Logs: elastic-agent-diagnostics-2024-07-09T11-26-41Z-00.zip

New policy: elastic-agent (1).zip

elasticmachine commented 2 months ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

amolnater-qasource commented 2 months ago

@manishgupta-qasource Please review.

intxgo commented 2 months ago

Looks like Agent bug. I can see Agent successfully installing Endpoint, then detecting error on the bootstrap pipe and uninstalling Endpoint. I'm not sure if the bootstrap pipe "error" was the reason of Endpoint removal, we can see "endpoint service has checked in, send stopping state to service" so those services did connected to each over on the bootstrap pipe.

{"log.level":"error","@timestamp":"2024-07-09T11:15:47.300Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.executeCommand.func2","file.name":"runtime/service_command.go","file.line":69},"message":"2024-07-09 11:15:47: info: Exec.cpp:1177 Successfully ran /bin/launchctl load -w /Library/LaunchDaemons/co.elastic.endpoint.plist","context":"command output","ecs.version":"1.6.0"}

{"log.level":"error","@timestamp":"2024-07-09T11:17:43.731Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.newConnInfoServer.func1","file.name":"runtime/conn_info_server.go","file.line":56},"message":"failed accept conn info connection: accept unix /Library/Elastic/Agent/.eaci.sock: use of closed network connection","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-07-09T11:17:43.731Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.(*serviceRuntime).stop","file.name":"runtime/service.go","file.line":376},"message":"stopping endpoint service runtime","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-07-09T11:17:43.731Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.(*serviceRuntime).stop","file.name":"runtime/service.go","file.line":392},"message":"endpoint service has checked in, send stopping state to service","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-07-09T11:17:43.731Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.(*serviceRuntime).stop","file.name":"runtime/service.go","file.line":400},"message":"uninstall endpoint service","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-07-09T11:17:43.993Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"function":"github.com/elastic/elastic-agent/pkg/component/runtime.executeCommand.func2","file.name":"runtime/service_command.go","file.line":69},"message":"2024-07-09 11:17:43: info: MainPosix.cpp:262 Executing uninstall","context":"command output","ecs.version":"1.6.0"}
manishgupta-qasource commented 2 months ago

Secondary Review for this ticket is Done

cmacknz commented 2 months ago

It looks to me like we only got a policy change to uninstall endpoint, but not one for re-installing it. The last change in the sequence contains "removed":["log-default","system/metrics-default","endpoint-default".

{"log.level":"info","@timestamp":"2024-07-09T11:15:39.177Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).checkAndLogUpdate","file.name":"coordinator/coordinator.go","file.line":1479},"message":"component model updated","log":{"source":"elastic-agent"},"changes":{"components":{"updated":["beat/metrics-monitoring: [(beat/metrics-monitoring-metrics-monitoring-beats: updated)]","http/metrics-monitoring: [(http/metrics-monitoring-metrics-monitoring-agent: updated)]"],"count":4},"outputs":{}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-07-09T11:15:47.414Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).checkAndLogUpdate","file.name":"coordinator/coordinator.go","file.line":1479},"message":"component model updated","log":{"source":"elastic-agent"},"changes":{"components":{"updated":["beat/metrics-monitoring: [(beat/metrics-monitoring-metrics-monitoring-beats: updated)]","http/metrics-monitoring: [(http/metrics-monitoring-metrics-monitoring-agent: updated)]"],"count":6},"outputs":{}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-07-09T11:16:11.718Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).checkAndLogUpdate","file.name":"coordinator/coordinator.go","file.line":1479},"message":"component model updated","log":{"source":"elastic-agent"},"changes":{"components":{"added":["system/metrics-monitoring"],"updated":["beat/metrics-monitoring: [(beat/metrics-monitoring-metrics-monitoring-beats: updated)]","http/metrics-monitoring: [(http/metrics-monitoring-metrics-monitoring-agent: updated)]"],"count":7},"outputs":{}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-07-09T11:17:42.999Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).checkAndLogUpdate","file.name":"coordinator/coordinator.go","file.line":1479},"message":"component model updated","log":{"source":"elastic-agent"},"changes":{"components":{"removed":["log-default","system/metrics-default","endpoint-default","system/metrics-monitoring"],"updated":["http/metrics-monitoring: [(http/metrics-monitoring-metrics-monitoring-agent: updated)]","filestream-monitoring: [(filestream-monitoring-filestream-monitoring-agent: updated)]","beat/metrics-monitoring: [(beat/metrics-monitoring-metrics-monitoring-beats: updated)]"],"count":3},"outputs":{}},"ecs.version":"1.6.0"}

I had thought this might be a problem re-binding to the unix socket, but we have logic to remove it before creating and binding to it again: https://github.com/elastic/elastic-agent/blob/d09ef623edfda136dd7392042f2020e4326a45c9/pkg/ipc/listener.go#L27-L33

Problems re-binding to the unix socket would also be reproducible on Linux if this were the root cause.

cmacknz commented 2 months ago

The policy in the diagnostics is actually missing endpoint as suspected, so we uninstalled endpoint as instructed, and at the point in time in the diagnostics, agent wasn't told to install it again.

agent:
    download:
        sourceURI: https://staging.elastic.co/8.15.0-c7717606/downloads/
    features: null
    monitoring:
        enabled: true
        logs: true
        metrics: true
        namespace: mac
        use_output: default
    protection:
        enabled: false
        signing_key: <REDACTED>
        uninstall_token_hash: <REDACTED>
fleet:
    hosts:
        - https://0ea335ba862e49e18458a3ef58111cc8.fleet.us-west2.gcp.elastic-cloud.com:443
host:
    id: 298365BA-DC99-512C-9E42-026403F89999
id: 30e3cb63-3e88-448d-ad4b-a392fe117849
outputs:
    default:
        api_key: <REDACTED>
        hosts:
            - https://aa6a148468024c99b25c706c496ece04.us-west2.gcp.elastic-cloud.com:443
        preset: balanced
        type: elasticsearch
path:
    config: /Library/Elastic/Agent
    data: /Library/Elastic/Agent/data
    home: /Library/Elastic/Agent/data/elastic-agent-8.15.0-b7f8e2
    logs: /Library/Elastic/Agent
revision: 1
runtime:
    arch: arm64
    native_arch: arm64
    os: darwin
    osinfo:
        family: darwin
        major: 14
        minor: 5
        patch: 0
        type: macos
        version: "14.5"
signed:
    data: eyJpZCI6IjMwZTNjYjYzLTNlODgtNDQ4ZC1hZDRiLWEzOTJmZTExNzg0OSIsImFnZW50Ijp7ImZlYXR1cmVzIjp7fSwicHJvdGVjdGlvbiI6eyJlbmFibGVkIjpmYWxzZSwidW5pbnN0YWxsX3Rva2VuX2hhc2giOiJINEdsamk0SGI3UlQyVVVLOC9jakxFTlFzVzFHSUx6eW1ac0o4c3NVWllBPSIsInNpZ25pbmdfa2V5IjoiTUZrd0V3WUhLb1pJemowQ0FRWUlLb1pJemowREFRY0RRZ0FFdER3MHVRWWRySTg1akdSNzBsTGhWNjdJOHcrc0VxL2RQQmNnVlozV2xpRHF6c1hobCtLVHBMNFNhTStYN3FldUNHRWZrUnk4NU9vK1pGN01Od3ArZFE9PSJ9fSwiaW5wdXRzIjpbXX0=
    signature: MEYCIQCTZ9D2JuolFBk2W5REVzC3Yo8+DEDyk4+9YCmX7dSumQIhAKaWnb8xuk9lqhydJEdg2Dtm9IqNv57W+faf3CMS/r3U
cmacknz commented 2 months ago

I just tried this and the policy updated on the agent almost immediately. @amolnater-qasource is this reproducible for you?

sudo elastic-agent status --output=full
┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   ├─ status: (HEALTHY) Running
   ├─ info
   │  ├─ id: 56f54ff3-e21e-4be8-b193-e9b669e3177a
   │  ├─ version: 8.15.0
   │  └─ commit: b7f8e2a061203c997df73ccc9e4447e3184907db
   ├─ endpoint-default
   │  ├─ status: (HEALTHY) Healthy: communicating with endpoint service
   │  ├─ endpoint-default
   │  │  ├─ status: (HEALTHY) Applied policy {c6b1c82c-3466-4590-a533-90a7777c2cea}
   │  │  └─ type: OUTPUT
   │  └─ endpoint-default-c6b1c82c-3466-4590-a533-90a7777c2cea
   │     ├─ status: (HEALTHY) Applied policy {c6b1c82c-3466-4590-a533-90a7777c2cea}
   │     └─ type: INPUT

❯ sudo elastic-agent status --output=full
┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   ├─ status: (HEALTHY) Running
   ├─ info
   │  ├─ id: 56f54ff3-e21e-4be8-b193-e9b669e3177a
   │  ├─ version: 8.15.0
   │  └─ commit: b7f8e2a061203c997df73ccc9e4447e3184907db
   ├─ endpoint-default
   │  ├─ status: (HEALTHY) Healthy: communicating with endpoint service
   │  ├─ endpoint-default
   │  │  ├─ status: (HEALTHY) Applied policy {dcbd8b3f-4081-46c4-95fa-bc48263543d3}
   │  │  └─ type: OUTPUT
   │  └─ endpoint-default-dcbd8b3f-4081-46c4-95fa-bc48263543d3
   │     ├─ status: (HEALTHY) Applied policy {dcbd8b3f-4081-46c4-95fa-bc48263543d3
amolnater-qasource commented 2 months ago

Hi @cmacknz

Thank you for looking into this issue.

We have revalidated this issue on 8.15.0 BC1 and please find more observations for the same:

Agent installed was installed with: elastic-agent.zip

Agent assigned to new Policy(These policies are duplicated from the policy with which the agent was installed.): elastic-agent (1).zip elastic-agent (2).zip

Agent diagnostics: elastic-agent-diagnostics-2024-07-10T05-28-18Z-00.zip

Could you please confirm if there's any conflict with the shared policies?

Please let us know if anything else is required from our end. Thanks!

cmacknz commented 1 month ago

I see the same as above, we are only getting the last policy update removing everything:

{"log.level":"info","@timestamp":"2024-07-10T05:27:25.719Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).checkAndLogUpdate","file.name":"coordinator/coordinator.go","file.line":1479},"message":"component model updated","log":{"source":"elastic-agent"},"changes":{"components":{"removed":["system/metrics-monitoring","endpoint-default","log-default","system/metrics-default"],"updated":["beat/metrics-monitoring: [(beat/metrics-monitoring-metrics-monitoring-beats: updated)]","http/metrics-monitoring: [(http/metrics-monitoring-metrics-monitoring-agent: updated)]","filestream-monitoring: [(filestream-monitoring-filestream-monitoring-agent: updated)]"],"count":3},"outputs":{}},"ecs.version":"1.6.0"}
juliaElastic commented 1 month ago

I tried to reproduce the issue, but couldn't yet, the agent receives the endpoint input as expected after policy reassign.

@amolnater-qasource Could we get the output of these queries from Kibana console if the deployment is still running? Or if you can share the deployment ID, I can look at it in admin.

GET .fleet-actions/_search?size=100

GET .fleet-policies/_search?size=100

GET kbn:/api/fleet/agents/action_status
amolnater-qasource commented 1 month ago

Hi @juliaElastic Thank you for looking into this issue.

Please find below output for the queries: GET .fleet-actions/_search?size=100 Actions.txt

GET kbn:/api/fleet/agents/action_status Status.txt

We tried running GET .fleet-policies/_search?size=100, it crashed the browser every time on running the query,

Further, we will share the deployment id with you over slack.

Thanks!

juliaElastic commented 1 month ago

Thanks Amol.

Looking at the data, I found a potential bug, here is the policy the agent is reassigned to, it seems that the agent policy tamper protection was switched on, but the policy revision was not bumped, and .fleet-policies has 2 docs of policy revision_idx:1, coordinator_idx:1, one of them doesn't contain inputs. So fleet-server randomly picks one of them, and it happens to be the one without inputs. I'll check the code in kibana.

I could reproduce this locally with 8.15 fleet-server (8.16 doesn't have coordinator anymore, so no duplicate documents are created). It seems to happen only when copying a policy with tamper protection enabled, this is when the policy ends up in revision:1. Otherwise we cannot create an agent policy with tamper protection, since we have to add endpoint integration first, and the revision ends up > 1.

Reduced the impact to medium, as I think the copy policy feature is not used that frequently.

 GET .fleet-policies/_search?q=30e3cb63-3e88-448d-ad4b-a392fe117849

    {
        "_index": ".fleet-policies-7",
        "_id": "7dd3b7e4-2ae3-516b-990e-8a645bb6c434",
        "_score": 4.7706842,
        "_source": {
          "@timestamp": "2024-07-09T11:09:13.259Z",
          "revision_idx": 1,
          "coordinator_idx": 0,
          "data": {
            "id": "30e3cb63-3e88-448d-ad4b-a392fe117849",
            "inputs": [
              {
                "id": "8a9c6a8f-42b4-468a-94a4-2dd3378fcdea",
                "revision": 1,
                "name": "ED mac (copy)",
                "type": "endpoint",
                "data_stream": {
                  "namespace": "mac"
                },
                "use_output": "default",
                "package_policy_id": "8a9c6a8f-42b4-468a-94a4-2dd3378fcdea",
                ...
      },
      {
        "_index": ".fleet-policies-7",
        "_id": "7Pkul5ABaxNQL_pV7eMj",
        "_score": 4.7706842,
        "_source": {
          "coordinator_idx": 1,
          "data": {
            "fleet": {
              "hosts": [
                "https://0ea335ba862e49e18458a3ef58111cc8.fleet.us-west2.gcp.elastic-cloud.com:443"
              ]
            },
            "id": "30e3cb63-3e88-448d-ad4b-a392fe117849",
          },
          "default_fleet_server": false,
          "policy_id": "30e3cb63-3e88-448d-ad4b-a392fe117849",
          "revision_idx": 1,
          "@timestamp": "2024-07-09T11:09:10.709Z"
        }
      },
      {
        "_index": ".fleet-policies-7",
        "_id": "7fkul5ABaxNQL_pV7uMs",
        "_score": 4.7706842,
        "_source": {
          "coordinator_idx": 1,
          "data": {
            "id": "30e3cb63-3e88-448d-ad4b-a392fe117849",
            "inputs": [
              {
                "id": "8a9c6a8f-42b4-468a-94a4-2dd3378fcdea",
                "integration_config": {
                  "endpointConfig": {
                    "preset": "EDRComplete"
                  },
                  "type": "endpoint"
                },
                "meta": {
                  "package": {
                    "name": "endpoint",
                    "version": "8.15.0"
                  }
                },
                "name": "ED mac (copy)",
                "package_policy_id": "8a9c6a8f-42b4-468a-94a4-2dd3378fcdea",
                "revision": 1,
                "type": "endpoint",
                "use_output": "default"
              },
          },
          "default_fleet_server": false,
          "policy_id": "30e3cb63-3e88-448d-ad4b-a392fe117849",
          "revision_idx": 1,
          "@timestamp": "2024-07-09T11:09:13.259Z"
        }
      }
elasticmachine commented 1 month ago

Pinging @elastic/fleet (Team:Fleet)

amolnater-qasource commented 1 month ago

Hi Team,

We have revalidated this issue on latest 8.15.0 BC6 kibana cloud environment and found it fixed now.

Observations:

Logs: elastic-agent-diagnostics-2024-08-07T10-37-57Z-00.zip

Build details: VERSION: 8.15.0 BC6 BUILD: 76360 COMMIT: 8aa0b59da12c996e3048d887546667ee6e15c7f

Hence, we are marking this issue as QA:Validated.

Thanks!