elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
123 stars 131 forks source link

Elastic Agent doesn't update the enrollment token in Kubernetes Deployment statefile #3586

Open ch9hn opened 11 months ago

ch9hn commented 11 months ago

When a new enrollment token is updated as env or envFrom in Kubernetes Manifest, this new token is not reflected in Elastic Agent. Reason for that is probably the fact, that Elastic Agent saves the state locally on every Kubernetes Node and doesn't update the new token. This leads to Unauthorised issues on the Agent - a redeploy with a new token is not possible anymore.

For confirmed bugs, please report:

  1. Install Elastic Agent on Kubernetes Cluster as described in the docs with enrollment-token "ABC"
  2. Expire "ABC" and add new token "DEF"
  3. Restart Elastic Agent Daemonset
  4. Result: Old token "ABC" is persisted and used for the communication to Elastic Fleet Server

Error logs:

"Failed to connect to backoff(elasticsearch(https://xxxx.xxxx.cloud.es.io:443)): 401 Unauthorized: {\"error\":{\"root_cause\":[{\"type\":\"security_exception\",\"reason\":\"unable to authenticate with provided credentials and anonymous access is not allowed for this request\",\"additional_unsuccessful_credentials\":\"API key: api key [xxxxxxx] has been invalidated\",\"header\":{\"WWW-Authenticate\":[\"Basic realm=\\\"security\\\" charset=\\\"UTF-8\\\"\",\"Bearer

How to temporary fix: When using Kustomize deployment, the hostPath can be overwritten quite easily with the following DaemonSet overwrite:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: elastic-agent
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: elastic-agent
  template:
    metadata:
      labels:
        app: elastic-agent
    spec:
      containers:
        - name: elastic-agent
          env:
            - name: FLEET_URL
              $patch: delete
            - name: FLEET_ENROLLMENT_TOKEN
              $patch: delete
            - name: FLEET_INSECURE
              value: "false"
            - name: KIBANA_HOST
              $patch: delete
            - name: KIBANA_FLEET_USERNAME
              $patch: delete
            - name: KIBANA_FLEET_PASSWORD
              $patch: delete
          envFrom:
            - secretRef:
                name: elastic-agent-token
          volumeMounts:
            - name: elastic-agent-state
              mountPath: /usr/share/elastic-agent/state
      volumes:
        - name: elastic-agent-state
          hostPath:
# Change path here to your deployment namespace or use another name
            path: /var/lib/elastic-agent-managed/monitoring/state. 
            type: DirectoryOrCreate
elasticmachine commented 11 months ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

shubhu934 commented 10 months ago

Hello @elastic/elastic-agent Do we have any deadline for the fix of this issue ?

cmacknz commented 10 months ago

This isn't prioritized yet, but it is definitely annoying and has wasted some time even for people inside Elastic. CC @pierrehilbert

HGS9761 commented 10 months ago

Hello I am affected by this bug and it is not a minor issue. So this should have high priority.

cgi-gerlando-caldara commented 7 months ago

Hello together, same issue on production nodes, blocking elastic agents tests, workaround to cleanup state folder works but still very unprofessional.

rafaelbattesti commented 7 months ago

Hello folks. I have come across this issue and reproduced with support. This has consumed a few days digging into the root cause and I believe we should give this some attention.

cgi-gerlando-caldara commented 7 months ago

My Idea: On Fleet managed elastic agents it would make sense to have "Elastic State Cleanup" Button, so this is easy to handle over Kibana UI. Manually reset the state on more than 60 deployed elastic agents drives the admins angry.

neu7ron2 commented 6 months ago

Thanks @ch9hn , I just spent a day trying to figure out why the Kubernetes elastic agents were working yesterday and not today after I updated the cluster. I uninstalled the agents and installed the previous version with no luck. I finally saw there were some files on the machine in /var/lib/elastic-agent-managed/kube-system/state. I realised the install manifest files had mounted and saved connections details.

Solution: I deleted the whole elastic-agent-managed folder on each machine and reinstalled the Kubernetes agent manifest file and it worked.

@elasticmachine Elastic team please fix this basic bug!! hours wasted!

badele commented 5 months ago

After hours of investigating and reinstalling elastic-agent on kube, and understanding why I had this message "Failed to connect to backoff(elasticsearch(https://244b20202ef45ddb481e55df6b19f4.eu-west-3.aws.elastic-cloud.com:443"

I searched for this error message on the 'elastic-cloud' site but no document ...

On searching, I realized that something was stored persistently, so my suspicions fell on the /usr/share/elastic-agent/state directory.

Indeed, in the manifest generated by elastic-cloud, we can see that it mounts the directory of /var/lib/elastic-agent-managed/kube-system/state from a kube node

        - name : elastic-agent-state
          hostPath :
            path : /var/lib/elastic-agent-managed/kube-system/state
            type : DirectoryOrCreate

Personally, I don't like the fact that they use the node's disk space to store persistence data.

But here's my solution to correct the problem

for pod in $(kubectl -n kube-system get pod --no-headers -o custom-columns=":metadata.name" -l app=elastic-agent); do kubectl -n kube-system exec $pod -- rm -rf /usr/share/elastic-agent ; done
for pod in $(kubectl -n kube-system get pod --no-headers -o custom-columns=":metadata.name" -l app=elastic-agent); do ;  kubectl -n kube-system delete pods $pod ; done
elasticmachine commented 4 months ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

michel-laterman commented 4 months ago

Quick recap of the discussion had surrounding this item from our weekly meeting:

chrko commented 2 months ago

Any updates on this especially regarding prio? We currently in discussion with our managed k8s cluster operator team so that they implement a small operator (taking the fleet url and the enrollment token) as the actual cluster users is not allowed to run any kind privileged pods on their own. Before figuring out workarounds which have only a minimal impact e.g. on shipping logs / log duplication, I would like clarify if and how we can maybe prioritize this upstream or maybe also can contribute.

mag-mkorn commented 2 months ago

Another workaround that persists the registry to prevent log duplication:

<...>
          volumeMounts:
            - name: elastic-agent-state
              mountPath: /usr/share/elastic-agent/state/data
<...>
      volumes:
        - name: elastic-agent-state
          hostPath:
            path: /var/lib/elastic-agent-managed/monitoring/state/data
            type: DirectoryOrCreate
<...>
chrko commented 2 months ago

Based on what @mag-mkorn commented, we'll test and use the following as init container script (base alpine image is sufficient):

#!/usr/bin/env sh

set -eu

# Set pipefail if it works in a subshell, disregard if unsupported
# shellcheck disable=SC3040
(set -o pipefail 2>/dev/null) && set -o pipefail

STATE_DIRECTORY=/usr/share/elastic-agent/state
DATA_DIRECTORY=${STATE_DIRECTORY}/data

HASH_FILE=${STATE_DIRECTORY}/.env-hash
HASH_TARGET="$(printf "%s\0%s" "${FLEET_URL?}" "${FLEET_ENROLLMENT_TOKEN?}" | sha256sum -)"

prune_state() {
  find "${STATE_DIRECTORY}" -path "${DATA_DIRECTORY}" -prune -o -type f -print0 |
    xargs -0 --no-run-if-empty rm -v
}

save_hash() {
  echo "Save target hash into $HASH_FILE."
  printf "%s" "$HASH_TARGET" >"$HASH_FILE"
}

if [ -f "$HASH_FILE" ]; then
  echo "Existing hash found, comparing..."
  # cmp saved hash to target value
  HASH_CURRENT="$(cat "$HASH_FILE")"
  if [ "$HASH_TARGET" = "$HASH_CURRENT" ]; then
    echo "Not change detected, no cleanup required."
  else
    echo "Existing hash do not match target hash. Pruning files without data dir..."
    prune_state
    save_hash
  fi
else
  save_hash
fi
mgaruccio commented 1 month ago

Just a note here that the issue also impacts attempting to change the target elastic cluster, not just token changes. with the added wrinkle there that the agent will take control of the state directory and disallow deleting it so you're stuck unless you create an init container to clear the state.