elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
124 stars 134 forks source link

Elastic Agent - Agent ID is not retained in kubernetes elastic agent daemonset #5185

Open jade-lucas opened 1 year ago

jade-lucas commented 1 year ago

Bug Report

What did you do?

Installed elastic agents on kubneretes cluster via DaemonSet as directed by the Fleet UI. Every time an elastic-agent pod is restarted, the agent ID changes which I think results many offline agents showing in the Fleet UI over time.

What did you expect to see? The agent to retain its Id thus only one agent to appear in Fleet UI.

What did you see instead? Under which circumstances? After every pod restart, a new agent appears even though it is on the same host.

pebrc commented 1 year ago

Probably a question for the @elastic/fleet team. I am assuming the agent identity is tied to the Pod name rather than the k8s host.

johanlundberg92 commented 6 months ago

Anyone found a way to solve this? Found out about this issue today when evaluating ECK.

kpollich commented 6 months ago

cc @elastic/elastic-agent as well - I think we've seen similar issues around ID's in containerized environments before, but I wasn't able to find a GitHub issue that summarized where we stand. Does anyone on the agent team know of anything that might be of interest here?

cmacknz commented 6 months ago

Fleet managed agents are stateful, and need to persist their Fleet API key and agent ID outside of the pod file system to prevent this.

I thought ECK did this by default by putting the agent state in a host path volume mounted from the node file system: https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-elastic-agent-configuration-examples.html#k8s_storing_local_state_in_host_path_volume

This is the same thing our reference YAML does for deploying a managed Elastic agent manually: https://github.com/elastic/elastic-agent/pull/2550/files

Using a host path volume for this works consistently when the agent is a DaemonSet, because there is always one agent pod per node and the state in the file system of each node.

If you are using another deployment type there is no natural affinity between the host path volume and the running agent (if it were a deployment the agent could end up on a different node every time).

lpeter91 commented 2 months ago

I can confirm this issue is still present.

It can be easily reproduced by applying the manifest below, then deleting the agent pod. The recreated pod will cause a new agent to appear in the Kibana Fleet UI. This manifest is exactly the same as the simplest quickstart example from the documentation, only with the emptyDir volume removed. This way the agent state will be on a host path volume, but despite this fact the issue prevails.

Manifest:

apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: fleet-server-quickstart
  namespace: default
spec:
  version: 8.14.3
  kibanaRef:
    name: kibana-quickstart
  elasticsearchRefs:
  - name: elasticsearch-quickstart
  mode: fleet
  fleetServerEnabled: true
  policyID: eck-fleet-server
  deployment:
    replicas: 1
    podTemplate:
      spec:
        serviceAccountName: elastic-agent
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0 
---
apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: elastic-agent-quickstart
  namespace: default
spec:
  version: 8.14.3
  kibanaRef:
    name: kibana-quickstart
  fleetServerRef:
    name: fleet-server-quickstart
  mode: fleet
  policyID: eck-agent
  daemonSet:
    podTemplate:
      spec:
        serviceAccountName: elastic-agent
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0 
---
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: kibana-quickstart
  namespace: default
spec:
  version: 8.14.3
  count: 1
  elasticsearchRef:
    name: elasticsearch-quickstart
  config:
    xpack.fleet.agents.elasticsearch.hosts: ["https://elasticsearch-quickstart-es-http.default.svc:9200"]
    xpack.fleet.agents.fleet_server.hosts: ["https://fleet-server-quickstart-agent-http.default.svc:8220"]
    xpack.fleet.packages:
      - name: system
        version: latest
      - name: elastic_agent
        version: latest
      - name: fleet_server
        version: latest
    xpack.fleet.agentPolicies:
      - name: Fleet Server on ECK policy
        id: eck-fleet-server
        namespace: default
        monitoring_enabled:
          - logs
          - metrics
        unenroll_timeout: 900
        package_policies:
        - name: fleet_server-1
          id: fleet_server-1
          package:
            name: fleet_server
      - name: Elastic Agent on ECK policy
        id: eck-agent
        namespace: default
        monitoring_enabled:
          - logs
          - metrics
        unenroll_timeout: 900
        package_policies:
          - name: system-1
            id: system-1
            package:
              name: system
---
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch-quickstart
  namespace: default
spec:
  version: 8.14.3
  nodeSets:
  - name: default
    count: 3
    config:
      node.store.allow_mmap: false
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: elastic-agent
rules:
- apiGroups: [""] # "" indicates the core API group
  resources:
  - pods
  - nodes
  - namespaces
  verbs:
  - get
  - watch
  - list
- apiGroups: ["coordination.k8s.io"]
  resources:
  - leases
  verbs:
  - get
  - create
  - update
- apiGroups: ["apps"]
  resources:
  - replicasets
  verbs:
  - list
  - watch
- apiGroups: ["batch"]
  resources:
  - jobs
  verbs:
  - list
  - watch
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: elastic-agent
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: elastic-agent
subjects:
- kind: ServiceAccount
  name: elastic-agent
  namespace: default
roleRef:
  kind: ClusterRole
  name: elastic-agent
  apiGroup: rbac.authorization.k8s.io

Examining the pod (kubectl get pod elastic-agent-quickstart-agent-2x6rv -o yaml) shows the host volume mount:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    agent.k8s.elastic.co/config-hash: "777581219"
  creationTimestamp: "2024-07-19T18:40:48Z"
  generateName: elastic-agent-quickstart-agent-
  labels:
    agent.k8s.elastic.co/name: elastic-agent-quickstart
    agent.k8s.elastic.co/version: 8.14.3
    common.k8s.elastic.co/type: agent
    controller-revision-hash: 6f567dd69d
    pod-template-generation: "1"
  name: elastic-agent-quickstart-agent-2x6rv
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: elastic-agent-quickstart-agent
    uid: b53e17d1-a4d3-4163-b66a-6250524a387a
  resourceVersion: "1952"
  uid: 1b2cfba5-dd27-4085-8eab-3c5205b9824b
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - minikube
  automountServiceAccountToken: true
  containers:
  - command:
    - /usr/bin/env
    - bash
    - -c
    - |
      #!/usr/bin/env bash
      set -e
      if [[ -f /mnt/elastic-internal/elasticsearch-association/default/elasticsearch-quickstart/certs/ca.crt ]]; then
        if [[ -f /usr/bin/update-ca-trust ]]; then
          cp /mnt/elastic-internal/elasticsearch-association/default/elasticsearch-quickstart/certs/ca.crt /etc/pki/ca-trust/source/anchors/
          /usr/bin/update-ca-trust
        elif [[ -f /usr/sbin/update-ca-certificates ]]; then
          cp /mnt/elastic-internal/elasticsearch-association/default/elasticsearch-quickstart/certs/ca.crt /usr/local/share/ca-certificates/
          /usr/sbin/update-ca-certificates
        fi
      fi
      /usr/bin/tini -- /usr/local/bin/docker-entrypoint -e
    env:
    - name: FLEET_CA
      value: /mnt/elastic-internal/fleetserver-association/default/fleet-server-quickstart/certs/ca.crt
    - name: FLEET_ENROLL
      value: "true"
    - name: FLEET_ENROLLMENT_TOKEN
      valueFrom:
        secretKeyRef:
          key: FLEET_ENROLLMENT_TOKEN
          name: elastic-agent-quickstart-agent-envvars
          optional: false
    - name: FLEET_URL
      value: https://fleet-server-quickstart-agent-http.default.svc:8220
    - name: CONFIG_PATH
      value: /usr/share/elastic-agent
    - name: NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    image: docker.elastic.co/beats/elastic-agent:8.14.3
    imagePullPolicy: IfNotPresent
    name: agent
    resources:
      limits:
        cpu: 200m
        memory: 1Gi
      requests:
        cpu: 200m
        memory: 1Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /usr/share/elastic-agent/state
      name: agent-data
    - mountPath: /etc/agent.yml
      name: config
      readOnly: true
      subPath: agent.yml
    - mountPath: /mnt/elastic-internal/elasticsearch-association/default/elasticsearch-quickstart/certs
      name: elasticsearch-certs
      readOnly: true
    - mountPath: /mnt/elastic-internal/fleetserver-association/default/fleet-server-quickstart/certs
      name: fleetserver-certs-1
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-l6br6
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: minikube
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    runAsUser: 0
  serviceAccount: elastic-agent
  serviceAccountName: elastic-agent
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  volumes:
  - hostPath:
      path: /var/lib/elastic-agent/default/elastic-agent-quickstart/state
      type: DirectoryOrCreate
    name: agent-data
  - name: config
    secret:
      defaultMode: 288
      optional: false
      secretName: elastic-agent-quickstart-agent-config
  - name: elasticsearch-certs
    secret:
      defaultMode: 420
      optional: false
      secretName: fleet-server-quickstart-agent-es-default-elasticsearch-quickstart-ca
  - name: fleetserver-certs-1
    secret:
      defaultMode: 420
      optional: false
      secretName: elastic-agent-quickstart-agent-fleetserver-ca
  - name: kube-api-access-l6br6
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-07-19T18:40:50Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2024-07-19T18:40:48Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-07-19T18:40:50Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-07-19T18:40:50Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-07-19T18:40:48Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://00de922573eb6e4c1950efd4045865f58ed2f5df62d1accaf31b093bf9fe58b3
    image: docker.elastic.co/beats/elastic-agent:8.14.3
    imageID: docker-pullable://docker.elastic.co/beats/elastic-agent@sha256:78d39a9b321ff8cfd48bbe01d7439a38517958579f172f5bd19b4caa98ca074a
    lastState: {}
    name: agent
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-07-19T18:40:49Z"
  hostIP: 192.168.49.2
  hostIPs:
  - ip: 192.168.49.2
  phase: Running
  podIP: 10.244.0.10
  podIPs:
  - ip: 10.244.0.10
  qosClass: Guaranteed
  startTime: "2024-07-19T18:40:48Z"
lpeter91 commented 2 months ago

I think I found the issue and also a workaround/fix: I believe the Fleet API key and agent ID is stored in the fleet.enc file (difficult to verify due to it being encrypted), which is by default in /usr/share/elastic-agent and not in the persistent state subdirectory.

Setting the CONFIG_PATH env to a persistent volume seems to fix the problem. This will change the paths of the fleet.enc, elastic-agent.yml files and the vault directory. I think the latter two also need to be persistent, so this seems perfect. (elastic-agent.yml is modified to enable fleet and I'd guess the vault is used for the decryption of the encrypted files.)

@cmacknz The reference YAML in the agent repo doesn't set this env either, so I would assume it has the same issue as well.

cmacknz commented 2 months ago

Thanks, the fleet.enc file should probably just be moved into the state path of the container to simplify the amount of configuration that needs to happen.

elasticmachine commented 2 months ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)