elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
17 stars 144 forks source link

Unable to disable Kubernetes Watchers and Leader Election for Fleet Managed Agents #5558

Open btrieger opened 1 month ago

btrieger commented 1 month ago

For confirmed bugs, please report:

Attempting to deploy elastic agent on kubernetes to run threat intel integration or other api integrations. I am receiving errors related to missing permissions in kubernetes. As I am not running the kubernetes integration or monitoring kubernetes itself I shouldn't need access to the kubernetes api server to watch nodes, namespaces, and pods. I also should not need to create a lease without the kubernetes integration.

I am attempting to run fleet managed agents on kubernetes. I have deployed the following yamls:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elastic-agent-k8s-test
  namespace: elastic
  labels:
    app: elastic-agent-k8s-test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: elastic-agent-k8s-test
  template:
    metadata:
      labels:
        app: elastic-agent-k8s-test
    spec:
      serviceAccountName: elastic-agent
      hostNetwork: false
      dnsPolicy: ClusterFirst
      securityContext:
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
      containers:
        - name: elastic-agent
          image: docker.elastic.co/beats/elastic-agent:8.15.1
          args: ["-c", "/etc/elastic-agent/agent.yml"]
          env:
            # Set to 1 for enrollment into Fleet server. If not set, Elastic Agent is run in standalone mode
            - name: FLEET_ENROLL
              value: "1"
            # Set to true to communicate with Fleet with either insecure HTTP or unverified HTTPS
            - name: FLEET_INSECURE
              value: "false"
            # Fleet Server URL to enroll the Elastic Agent into
            # FLEET_URL can be found in Kibana, go to Management > Fleet > Settings
            - name: FLEET_URL
              value: "https://076433a0edff4346b220c692d2e9c56a.fleet.us-central1.gcp.cloud.es.io:443"
            # Elasticsearch API key used to enroll Elastic Agents in Fleet (https://www.elastic.co/guide/en/fleet/current/fleet-enrollment-tokens.html#fleet-enrollment-tokens)
            # If FLEET_ENROLLMENT_TOKEN is empty then KIBANA_HOST, KIBANA_FLEET_USERNAME, KIBANA_FLEET_PASSWORD are needed
            - name: FLEET_ENROLLMENT_TOKEN
              value: "CHANGEME"
            - name: FLEET_SERVER_POLICY_ID
              value: "8c813cf6-a816-4722-be51-7341a192ba2e"
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: STATE_PATH
              value: "/usr/share/elastic-agent/state"
            # The following ELASTIC_NETINFO:false variable will disable the netinfo.enabled option of add-host-metadata processor. This will remove fields host.ip and host.mac.
            # For more info: https://www.elastic.co/guide/en/beats/metricbeat/current/add-host-metadata.html
            - name: ELASTIC_NETINFO
              value: "false"
          securityContext:
            runAsUser: 1000
            runAsGroup: 1000
          resources:
            limits:
              memory: 700Mi
            requests:
              cpu: 100m
              memory: 400Mi
          volumeMounts:
            - name: agent-data
              mountPath: /usr/share/elastic-agent/state
            - name: datastreams
              mountPath: /etc/elastic-agent/agent.yml
              subPath: agent.yml
      volumes:
      - name: datastreams
        configMap:
          defaultMode: 0640
          name: agent-node-datastreams
  volumeClaimTemplates:
  - metadata:
      name: agent-data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 1Gi
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: elastic-agent
  namespace: elastic
  labels:
    k8s-app: elastic-agent
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: agent-node-datastreams
  namespace: elastic
  labels:
    k8s-app: elastic-agent-k8s-test
data:
  agent.yml: |-
    providers.kubernetes_leaderelection.enabled: false
    providers.kubernetes.resources.node.enabled: false
    providers.kubernetes.resources.pod.enabled: false
    fleet.enabled: true
---

I have also tried with the configmap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: agent-node-datastreams
  namespace: elastic
  labels:
    k8s-app: elastic-agent-k8s-test
data:
  agent.yml: |-
    providers.kubernetes_leaderelection.enabled: false
    providers.kubernetes.enabled: false
    fleet.enabled: true

and:

apiVersion: v1
kind: ConfigMap
metadata:
  name: agent-node-datastreams
  namespace: elastic
  labels:
    k8s-app: elastic-agent-k8s-test
data:
  agent.yml: |-
    providers.kubernetes_leaderelection.enabled: false
    providers.kubernetes:
      add_resource_metadata:
        node.enabled: false
        namespace.enabled: false
    fleet.enabled: true

The first time the pod starts it fails with the following error:

Policy selected for enrollment:  8c813cf6-a816-4722-be51-7341a192ba2e
{"log.level":"info","@timestamp":"2024-09-18T15:56:40.779Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/cmd.(*enrollCmd).enrollWithBackoff","file.name":"cmd/enroll_cmd.go","file.line":518},"message":"Starting enrollment to URL: https://076433a0edff4346b220c692d2e9c56a.fleet.us-central1.gcp.cloud.es.io:443/","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-09-18T15:56:41.857Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/cmd.(*enrollCmd).enrollWithBackoff","file.name":"cmd/enroll_cmd.go","file.line":524},"message":"1st enrollment attempt failed, retrying enrolling to URL: https://076433a0edff4346b220c692d2e9c56a.fleet.us-central1.gcp.cloud.es.io:443/ with exponential backoff (init 1s, max 10s)","ecs.version":"1.6.0"}
Error: fail to enroll: failed to store agent config: could not save enrollment information: could not backup /etc/elastic-agent/agent.yml: rename /etc/elastic-agent/agent.yml /etc/elastic-agent/agent.yml.2024-09-18T15-56-41.857.bak: permission denied

After it restarts the pod runs without lease election but repeatedly throws the below errors:

{"log.level":"error","@timestamp":"2024-09-18T15:58:13.293Z","message":"W0918 15:58:13.293281      85 reflector.go:539] k8s.io/client-go@v0.29.5/tools/cache/reflector.go:229: failed to list *v1.Node: nodes \"gk3-brieger-autopilot-nap-1s7i913f-4ca63d27-pphd\" is forbidden: User \"system:serviceaccount:elastic:elastic-agent\" cannot list resource \"nodes\" in API group \"\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"http/metrics-monitoring","type":"http/metrics"},"log":{"source":"http/metrics-monitoring"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-09-18T15:58:13.293Z","message":"E0918 15:58:13.293334      85 reflector.go:147] k8s.io/client-go@v0.29.5/tools/cache/reflector.go:229: Failed to watch *v1.Node: failed to list *v1.Node: nodes \"gk3-brieger-autopilot-nap-1s7i913f-4ca63d27-pphd\" is forbidden: User \"system:serviceaccount:elastic:elastic-agent\" cannot list resource \"nodes\" in API group \"\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"http/metrics-monitoring","type":"http/metrics"},"log":{"source":"http/metrics-monitoring"},"ecs.version":"1.6.0"}

I would expect the pod to not need to throw an error and restart to disable leader election and to be able to disable these watchers.

pkoutsovasilis commented 1 month ago

so I was able to reproduce the issue of the configmap on my end as well.

  1. In your case @btrieger the first issue is that you run elastic-agent under 1000:1000 but the mount of configmap is under 0:0 and your mode for it is 0640, thus other groups (e.g. the one of elastic-agent) has no access at all to that (this is why you see permission denied error).
  2. Even if the mode was 0644, during fleet enrollment we first try to rotate the config file to a bak by renaming it. However, renaming a mountpoint is not possible and we would get an error of device or resource is busy
  3. Even if the above wasn't like that, the replaceWith invoked by ReplaceOnSuccessStore here is having a value of application.DefaultAgentFleetConfig thus I don't see how a custom supplied config merges with the fleet one, but maybe I am missing something?!

Thus to support this feature we should:

  1. cp any external to the agent config inside the state folder as we have permissions to write there and work our way from there
  2. If it doesn't exist already; fabricate a merge policy of statically defined config with the one that comes from fleet?!

I am gonna follow also about the list Nodes permissions 🙂

btrieger commented 1 month ago

Ah my apologies on the 0640. Does fsGroup not mount it as 0:1000 instead of 0:0 so that I would have access to it? I appear to be able to read it since it is disabling the leader election after the 1st restart.

In any regards I figure the solutions is to find a way to add configs that can be passed down to an agent from fleet or merge the configmap with what fleet provides.

cmacknz commented 1 month ago

Even if the above wasn't like that, the replaceWith invoked by ReplaceOnSuccessStore here is having a value of application.DefaultAgentFleetConfig thus I don't see how a custom supplied config merges with the fleet one, but maybe I am missing something?!

https://github.com/elastic/elastic-agent/pull/4166 a change was made that if the new config contains the content of the default fleet config we won't do the replacement by rotation to help. This is not really obvious at all unless you know this PR exists. The default fleet config only contains fleet.enabled: true so having that should have been enough to get past that. Quoting the PR:

Skipping replacing the current agent configuration with default fleet configuration upon enrollment, in case the current configuration already contains the configuration from the default fleet configuration.

btrieger commented 1 month ago

I added a volume mount to set the /etc/elastic-agent folder to be owned by 0:1000 so that I would be able to write to it and I can confirm the device is busy is the result. I also udpated the mount for the configmap to be 0660 and confirmed elastic-agent is the group so it has read and write.

pkoutsovasilis commented 1 month ago

oh I see @cmacknz it is the other way around from what I understood the diff to be! I validated that by adding fleet.enabled: true in my static config I get no rotation and thus I get no error. However all my testing is with elastic-agent:8.16.0-SNAPSHOT which does "extra things" in the agent-state path. @btrieger I can see that you already have fleet.enabled: true in your config so could you send me the exact error you are seeing and maybe in the parallel give 8.16.0-SNAPSHOT a go? 🙂

btrieger commented 1 month ago

Yeah I can share the error. I read through the code and then update my config to be:

fleet:
  enabled: true

instead of

fleet.enabled: true

and that made it skip the replace. Both are valid yamls but would cause the diff to to not work.

btrieger commented 1 month ago

Here is the error on 8.15.0:

Policy selected for enrollment:  8c813cf6-a816-4722-be51-7341a192ba2e
{"log.level":"info","@timestamp":"2024-09-19T17:20:28.581Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/cmd.(*enrollCmd).enrollWithBackoff","file.name":"cmd/enroll_cmd.go","file.line":518},"message":"Starting enrollment to URL: https://076433a0edff4346b220c692d2e9c56a.fleet.us-central1.gcp.cloud.es.io:443/","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-09-19T17:20:29.591Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/cmd.(*enrollCmd).enrollWithBackoff","file.name":"cmd/enroll_cmd.go","file.line":524},"message":"1st enrollment attempt failed, retrying enrolling to URL: https://076433a0edff4346b220c692d2e9c56a.fleet.us-central1.gcp.cloud.es.io:443/ with exponential backoff (init 1s, max 10s)","ecs.version":"1.6.0"}
Error: fail to enroll: failed to store agent config: could not save enrollment information: could not backup /etc/elastic-agent/agent.yml: rename /etc/elastic-agent/agent.yml /etc/elastic-agent/agent.yml.2024-09-19T17-20-29.5911.bak: device or resource busy

and here is the configmap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: agent-node-datastreams
  namespace: elastic
  labels:
    k8s-app: elastic-agent-k8s-test
data:
  agent.yml: |-
    fleet.enabled: true
    providers.kubernetes_leaderelection.enabled: false
    providers.kubernetes.resources.node.enabled: false
    providers.kubernetes.resources.pod.enabled: false
---

When I did:

apiVersion: v1
kind: ConfigMap
metadata:
  name: agent-node-datastreams
  namespace: elastic
  labels:
    k8s-app: elastic-agent-k8s-test
data:
  agent.yml: |-
    fleet:
      enabled: true
    providers.kubernetes_leaderelection.enabled: false
    providers.kubernetes.resources.node.enabled: false
    providers.kubernetes.resources.pod.enabled: false
---

It did not throw the error

btrieger commented 1 month ago

I can't currently test 8.16.0 as my Elastic Cloud cluster is 8.15.1 which is latest available.

pkoutsovasilis commented 1 month ago

hmmm yep I think that by default gopkg.in/yaml.v3 (package used to calculate the diff) does not split keys by dots when unmarshalling YAML into a map[string]interface{}. So it makes sense

I can't currently test 8.16.0 as my Elastic Cloud cluster is 8.15.1 which is latest available.

No need to since the solution to this issue is the one you already figured out 🙂

btrieger commented 1 month ago

I still unfortunately haven't been able to get the node watcher turned off either. Tried a bunch of different settings to disable it.

pkoutsovasilis commented 1 month ago

@btrieger after looking at the code I found a really puzzling (to me) piece of validation code here which seems to cause the creation of watchers for pods and nodes always by default. Now I understand that this might be indeed the desired default behaviour but there should be a top level key enabled in the kubernetes provider that when the user specifies as false nothing gets enabled. @cmacknz thoughts?

Now @btrieger since you mentioned about nodes permissions only and not pod ones 😄 the following config will help get the nodes list error away, but we can't get away at the same with both that and the pods watcher

      providers:
        kubernetes_leaderelection:
          enabled: false
        kubernetes:
          resources:
            pod:
              enabled: true
            node:
              enabled: false
btrieger commented 1 month ago

Ah so there is a bug where I can't disable both only one? It appeared it also needed get watch and list access for namespaces.

pkoutsovasilis commented 1 month ago

Ah so there is a bug where I can't disable both only one? It appeared it also needed get watch and list access for namespaces.

then let's try to go even more aggressive on disabling 🙂

      providers:
        kubernetes_leaderelection:
          enabled: false
        kubernetes:
          add_resource_metadata:
            node:
              enabled: false
            namespace:
              enabled: false
            deployment: false
            cronjob: false
          resources:
            pod:
              enabled: true
            node:
              enabled: false
btrieger commented 1 month ago

I will try it again when I am back at my computer. So if I want to disable pods, namespaces, and nodes? Is that doable? Or is there a bug? Essentially want to just disable the kubernetes provider.

pkoutsovasilis commented 1 month ago

I will try it again when I am back at my computer. So if I want to disable pods, namespaces, and nodes? Is that doable? Or is there a bug? Essentially want to just disable the kubernetes provider.

From what I am seeing currently no, it's not doable. But hey I missed the fleet.enabled: true above so 🤞 I missed something here as well!?!

However, I am thinking that even if we manage to disable them at the agent-level, the add_kubernetes_metadata processor is enabled by default in filebeat and metribeat when invoked by elastic-agent. And I am afraid that the same permission errors will surface from there. But I am getting ahead myself 🙂 But there is definitely something to discuss with the team here, ty for you understanding and patience

btrieger commented 1 month ago

I got a chance to test it. Looks like still getting the errors but it could be the add_kubernetes_metadata processor not sure:

{"log.level":"error","@timestamp":"2024-09-20T13:12:37.123Z","message":"Error fetching data for metricset beat.stats: error making http request: Get \"http://unix/stats\": dial unix /usr/share/elastic-agent/state/data/tmp/xTEtpJ7117ppc6OYvJCaYHbDW8mLjXGe.sock: connect: connection refused","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log":{"source":"beat/metrics-monitoring"},"log.origin":{"file.line":256,"file.name":"module/wrapper.go","function":"github.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).fetch"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-09-20T13:12:37.123Z","message":"Error fetching data for metricset beat.stats: error making http request: Get \"http://unix/stats\": dial unix /usr/share/elastic-agent/state/data/tmp/akSPbdqgaHaTY0_J01-dsfYK6JpMz2zn.sock: connect: connection refused","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log":{"source":"beat/metrics-monitoring"},"log.origin":{"file.line":256,"file.name":"module/wrapper.go","function":"github.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).fetch"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-09-20T13:12:38.014Z","message":"W0920 13:12:38.011880      53 reflector.go:539] k8s.io/client-go@v0.29.5/tools/cache/reflector.go:229: failed to list *v1.Node: nodes \"gk3-brieger-autopilot-nap-1s7i913f-4ca63d27-pphd\" is forbidden: User \"system:serviceaccount:elastic:elastic-agent\" cannot list resource \"nodes\" in API group \"\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-09-20T13:12:38.014Z","message":"E0920 13:12:38.011927      53 reflector.go:147] k8s.io/client-go@v0.29.5/tools/cache/reflector.go:229: Failed to watch *v1.Node: failed to list *v1.Node: nodes \"gk3-brieger-autopilot-nap-1s7i913f-4ca63d27-pphd\" is forbidden: User \"system:serviceaccount:elastic:elastic-agent\" cannot list resource \"nodes\" in API group \"\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-09-20T13:12:38.021Z","message":"W0920 13:12:38.021120      28 reflector.go:539] k8s.io/client-go@v0.29.5/tools/cache/reflector.go:229: failed to list *v1.Node: nodes \"gk3-brieger-autopilot-nap-1s7i913f-4ca63d27-pphd\" is forbidden: User \"system:serviceaccount:elastic:elastic-agent\" cannot list resource \"nodes\" in API group \"\" at the cluster scope","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"httpjson-default","type":"httpjson"},"log":{"source":"httpjson-default"},"ecs.version":"1.6.0"}
pkoutsovasilis commented 1 month ago

yep I think this is coming from metricbeat now ..."component":{"binary":"metricbeat"...

cmacknz commented 1 month ago

Yes that is the add_kubernetes_metadata processor which currently can't be turned off. This requires https://github.com/elastic/elastic-agent/issues/4670.

elasticmachine commented 1 month ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

cmacknz commented 1 month ago

A faster way around https://github.com/elastic/elastic-agent/issues/4670 since it is complex would be to expose the parts we need to configure via an env var. Something similar was done with the add_cloud_metadata processor.

strawgate commented 1 month ago

We have proposed that the customer set automountServiceAccountToken: false in the Kubernetes manifest. This appears to prevent the K8s metadata and providers from starting and is ideal when the customer does not want to monitor K8s with their agent pods (for example when they are running an S3/SQS workload).

See the last line in this partial snippet for the location of the addition:

---
# For more information https://www.elastic.co/guide/en/fleet/current/running-on-kubernetes-managed-by-fleet.html
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: elastic-agent
  namespace: kube-system
  labels:
    app: elastic-agent
spec:
  selector:
    matchLabels:
      app: elastic-agent
  template:
    metadata:
      labels:
        app: elastic-agent
    spec:
      # Tolerations are needed to run Elastic Agent on Kubernetes control-plane nodes.
      # Agents running on control-plane nodes collect metrics from the control plane components (scheduler, controller manager) of Kubernetes
      automountServiceAccountToken: false
strawgate commented 1 month ago

The above proposed solution solved the customer's problem as removing the service token prevents the provider from starting in the first place