elastic / cloud-on-k8s

Elastic Cloud on Kubernetes
Other
55 stars 707 forks source link

[Docs] Provide Example of Persistent Storage Configuration for Elastic Agent/Fleet/Beats #5833

Open BenB196 opened 2 years ago

BenB196 commented 2 years ago

Proposal

Provide an example of setting up persistent storage for Elastic Agent/Fleet/Beats in the docs.

Use case. Why is this important?

Elastic Agents/Beats have a concept of a registry, which keeps track of states within the different modules. If you don't set these up with persistence, then every time you recreate them, the states are lost and start over. By providing an example of persistence, it will not only show how to do it, but also call out the fact that this can be an "issue".

thbkrkr commented 2 years ago

If you don't set these up with persistence, then every time you recreate them, the states are lost and start over.

ECK uses an hostPath volume to create a location for the runtime state of Agent to be persistent across container restarts. Are you saying it doesn't work?

By providing an example of persistence, it will not only show how to do it, but also call out the fact that this can be an "issue".

What kind of example of persistence are you thinking of? Could you elaborate more on how the current configuration may be a "problem"?

BenB196 commented 2 years ago

Hi @thbkrkr, sure, the general use case is that I have an Elastic Agent which stores the state of something, but has the possibility of moving to a different host.

I think today the hostPath only really considers a Daemonset deployment, and not really a Deployment deployment.

Example:

If I have a single Elastic Agent pod as part of a Deployment setup with a policy which includes an integration like Okta, it will store the last query against Okta in the Agent's state. If I, then drain/shutdown the underlying host (either for updating or removal from the cluster), the Agent will move to a new node in the Kubernetes cluster, at which point the Agent will start from scratch as the previous state file is lost.

This can be solved by adding a persistent volume that supports changing hosts (AWS EBS/EFS, NFS, etc...), rather than using hostPath.

This request is more to have this use case documented (and pointed out) and provide an example for creating a PV/PVC from within the Elastic Agent spec.

thbkrkr commented 2 years ago

Indeed, hostpath does not work when an Agent Pod is re-scheduled on another k8s node. Makes sense, thanks for the explanation.

BenB196 commented 2 years ago

Updated: 2022-08-08 To provide working example.

After messing around with this, I think this example would work with some cleanup/generalization.

What I did:

  1. Create PV/PVC

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
    annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
    finalizers:
    - kubernetes.io/pvc-protection
    name: test-persistence
    namespace: elastic-prod
    spec:
    accessModes:
    - ReadWriteOnce
    resources:
    requests:
      storage: 10Gi
    storageClassName: longhorn
    volumeMode: Filesystem
    volumeName: pvc-1d0b8c21-e279-411f-ba73-41cfa203e87c
  2. Create Elastic Agent Deployment:

    apiVersion: agent.k8s.elastic.co/v1alpha1
    kind: Agent
    metadata:
    name: test-persistence
    namespace: elastic-prod
    spec:
    deployment:
    podTemplate:
      metadata:
        creationTimestamp: null
      spec:
        automountServiceAccountToken: true
        containers:
        - env:
          - name: FLEET_ENROLLMENT_TOKEN
            value: <snipped>
          - name: NODE_IP
            valueFrom:
              fieldRef:
                fieldPath: status.hostIP
          - name: POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
          name: agent
          resources:
            limits:
              cpu: 400m
              memory: 2Gi
            requests:
              cpu: 400m
              memory: 2Gi
          volumeMounts:
          - mountPath: /usr/share/elastic-agent/state
            name: agent-data
        securityContext:
          runAsUser: 0
        serviceAccountName: elastic-agent
        volumes:
        - name: agent-data
          persistentVolumeClaim:
            claimName: test-persistence
    replicas: 1
    strategy:
      type: Recreate
    fleetServerRef:
    name: fleet-server
    http:
    service:
      metadata: {}
      spec: {}
    tls:
      certificate: {}
    kibanaRef:
    name: kibana-prod
    namespace: kibana-prod
    mode: fleet
    version: 8.3.2
  3. Look at the deployed deployment and see that both the volume and volumeMount are there:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-persistence-agent
  namespace: elastic-prod
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 0
  selector:
    matchLabels:
      agent.k8s.elastic.co/name: test-persistence
      common.k8s.elastic.co/type: agent
  strategy:
    type: Recreate
  template:
    metadata:
      annotations:
        agent.k8s.elastic.co/config-hash: "2850047686"
      creationTimestamp: null
      labels:
        agent.k8s.elastic.co/name: test-persistence
        agent.k8s.elastic.co/version: 8.3.2
        common.k8s.elastic.co/type: agent
    spec:
      automountServiceAccountToken: true
      containers:
      - command:
        - /usr/bin/env
        - bash
        - -c
        - |
          #!/usr/bin/env bash
          set -e
          if [[ -f /mnt/elastic-internal/elasticsearch-association/elastic-prod/es-prod/certs/ca.crt ]]; then
            if [[ -f /usr/bin/update-ca-trust ]]; then
              cp /mnt/elastic-internal/elasticsearch-association/elastic-prod/es-prod/certs/ca.crt /etc/pki/ca-trust/source/anchors/
              /usr/bin/update-ca-trust
            elif [[ -f /usr/sbin/update-ca-certificates ]]; then
              cp /mnt/elastic-internal/elasticsearch-association/elastic-prod/es-prod/certs/ca.crt /usr/local/share/ca-certificates/
              /usr/sbin/update-ca-certificates
            fi
          fi
          /usr/bin/tini -- /usr/local/bin/docker-entrypoint -e
        env:
        - name: FLEET_ENROLLMENT_TOKEN
          value: <snipped>
        - name: NODE_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        - name: POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        - name: FLEET_CA
          value: /mnt/elastic-internal/fleetserver-association/elastic-prod/fleet-server/certs/ca.crt
        - name: FLEET_ENROLL
          value: "true"
        - name: FLEET_URL
          value: https://fleet-server-agent-http.elastic-prod.svc:8220
        - name: KIBANA_FLEET_CA
          value: /mnt/elastic-internal/kibana-association/kibana-prod/kibana-prod/certs/ca.crt
        - name: KIBANA_FLEET_HOST
          value: https://kibana-prod-kb-http.kibana-prod.svc:5601
        - name: KIBANA_FLEET_PASSWORD
          valueFrom:
            secretKeyRef:
              key: KIBANA_FLEET_PASSWORD
              name: test-persistence-agent-envvars
              optional: false
        - name: KIBANA_FLEET_SETUP
          value: "true"
        - name: KIBANA_FLEET_USERNAME
          valueFrom:
            secretKeyRef:
              key: KIBANA_FLEET_USERNAME
              name: test-persistence-agent-envvars
              optional: false
        - name: CONFIG_PATH
          value: /usr/share/elastic-agent
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        image: docker.elastic.co/beats/elastic-agent:8.3.2
        imagePullPolicy: IfNotPresent
        name: agent
        resources:
          limits:
            cpu: 400m
            memory: 2Gi
          requests:
            cpu: 400m
            memory: 2Gi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /usr/share/elastic-agent/state
          name: agent-data
        - mountPath: /etc/agent.yml
          name: config
          readOnly: true
          subPath: agent.yml
        - mountPath: /mnt/elastic-internal/elasticsearch-association/elastic-prod/es-prod/certs
          name: elasticsearch-certs
          readOnly: true
        - mountPath: /mnt/elastic-internal/fleetserver-association/elastic-prod/fleet-server/certs
          name: fleetserver-certs-1
          readOnly: true
        - mountPath: /mnt/elastic-internal/kibana-association/kibana-prod/kibana-prod/certs
          name: kibana-certs-0
          readOnly: true
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        runAsUser: 0
      serviceAccount: elastic-agent
      serviceAccountName: elastic-agent
      terminationGracePeriodSeconds: 30
      volumes:
      - name: agent-data
        persistentVolumeClaim:
          claimName: test-persistence
      - name: config
        secret:
          defaultMode: 288
          optional: false
          secretName: test-persistence-agent-config
      - name: elasticsearch-certs
        secret:
          defaultMode: 420
          optional: false
          secretName: fleet-server-agent-es-elastic-prod-es-prod-ca
      - name: fleetserver-certs-1
        secret:
          defaultMode: 420
          optional: false
          secretName: test-persistence-agent-fleetserver-ca
      - name: kibana-certs-0
        secret:
          defaultMode: 420
          optional: false
          secretName: test-persistence-agent-kibana-ca
ghost commented 1 year ago

(adding to use-case)

I'm deploying an Agent in fleet server mode. Only want one instance of it, so Deployment. I want to avoid loosing state when it gets rescheduled on another node. Additionally, K8s nodes run with relatively small disks, and I'd prefer to avoid anything (with unknown sizing) using the nodes own disk - to avoid unexpected disk-full states.

SpencerLN commented 1 year ago

It would be great if ECK natively supported a persistent storage deployment mode for Elastic Agent when you are pulling from external data sources (SaaS provider APIs, etc.) The above workaround with creating your own PVC/PV works, but it is a bit annoying to have to manage differently from all of the other resources in ECK.