Open ch9hn opened 11 months ago
Pinging @elastic/elastic-agent (Team:Elastic-Agent)
Hello @elastic/elastic-agent Do we have any deadline for the fix of this issue ?
This isn't prioritized yet, but it is definitely annoying and has wasted some time even for people inside Elastic. CC @pierrehilbert
Hello I am affected by this bug and it is not a minor issue. So this should have high priority.
Hello together, same issue on production nodes, blocking elastic agents tests, workaround to cleanup state folder works but still very unprofessional.
Hello folks. I have come across this issue and reproduced with support. This has consumed a few days digging into the root cause and I believe we should give this some attention.
My Idea: On Fleet managed elastic agents it would make sense to have "Elastic State Cleanup" Button, so this is easy to handle over Kibana UI. Manually reset the state on more than 60 deployed elastic agents drives the admins angry.
Thanks @ch9hn , I just spent a day trying to figure out why the Kubernetes elastic agents were working yesterday and not today after I updated the cluster. I uninstalled the agents and installed the previous version with no luck. I finally saw there were some files on the machine in /var/lib/elastic-agent-managed/kube-system/state. I realised the install manifest files had mounted and saved connections details.
Solution: I deleted the whole elastic-agent-managed folder on each machine and reinstalled the Kubernetes agent manifest file and it worked.
@elasticmachine Elastic team please fix this basic bug!! hours wasted!
After hours of investigating and reinstalling elastic-agent on kube, and understanding why I had this message "Failed to connect to backoff(elasticsearch(https://244b20202ef45ddb481e55df6b19f4.eu-west-3.aws.elastic-cloud.com:443"
I searched for this error message on the 'elastic-cloud' site but no document ...
On searching, I realized that something was stored persistently, so my suspicions fell on the /usr/share/elastic-agent/state
directory.
Indeed, in the manifest generated by elastic-cloud, we can see that it mounts the directory of /var/lib/elastic-agent-managed/kube-system/state
from a kube node
- name : elastic-agent-state
hostPath :
path : /var/lib/elastic-agent-managed/kube-system/state
type : DirectoryOrCreate
Personally, I don't like the fact that they use the node's disk space to store persistence data.
But here's my solution to correct the problem
for pod in $(kubectl -n kube-system get pod --no-headers -o custom-columns=":metadata.name" -l app=elastic-agent); do kubectl -n kube-system exec $pod -- rm -rf /usr/share/elastic-agent ; done
for pod in $(kubectl -n kube-system get pod --no-headers -o custom-columns=":metadata.name" -l app=elastic-agent); do ; kubectl -n kube-system delete pods $pod ; done
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)
Quick recap of the discussion had surrounding this item from our weekly meeting:
ELASTIC_AGENT_AUTO_REENROLL
Any updates on this especially regarding prio? We currently in discussion with our managed k8s cluster operator team so that they implement a small operator (taking the fleet url and the enrollment token) as the actual cluster users is not allowed to run any kind privileged pods on their own. Before figuring out workarounds which have only a minimal impact e.g. on shipping logs / log duplication, I would like clarify if and how we can maybe prioritize this upstream or maybe also can contribute.
Another workaround that persists the registry to prevent log duplication:
<...>
volumeMounts:
- name: elastic-agent-state
mountPath: /usr/share/elastic-agent/state/data
<...>
volumes:
- name: elastic-agent-state
hostPath:
path: /var/lib/elastic-agent-managed/monitoring/state/data
type: DirectoryOrCreate
<...>
Based on what @mag-mkorn commented, we'll test and use the following as init container script (base alpine image is sufficient):
#!/usr/bin/env sh
set -eu
# Set pipefail if it works in a subshell, disregard if unsupported
# shellcheck disable=SC3040
(set -o pipefail 2>/dev/null) && set -o pipefail
STATE_DIRECTORY=/usr/share/elastic-agent/state
DATA_DIRECTORY=${STATE_DIRECTORY}/data
HASH_FILE=${STATE_DIRECTORY}/.env-hash
HASH_TARGET="$(printf "%s\0%s" "${FLEET_URL?}" "${FLEET_ENROLLMENT_TOKEN?}" | sha256sum -)"
prune_state() {
find "${STATE_DIRECTORY}" -path "${DATA_DIRECTORY}" -prune -o -type f -print0 |
xargs -0 --no-run-if-empty rm -v
}
save_hash() {
echo "Save target hash into $HASH_FILE."
printf "%s" "$HASH_TARGET" >"$HASH_FILE"
}
if [ -f "$HASH_FILE" ]; then
echo "Existing hash found, comparing..."
# cmp saved hash to target value
HASH_CURRENT="$(cat "$HASH_FILE")"
if [ "$HASH_TARGET" = "$HASH_CURRENT" ]; then
echo "Not change detected, no cleanup required."
else
echo "Existing hash do not match target hash. Pruning files without data dir..."
prune_state
save_hash
fi
else
save_hash
fi
Just a note here that the issue also impacts attempting to change the target elastic cluster, not just token changes. with the added wrinkle there that the agent will take control of the state directory and disallow deleting it so you're stuck unless you create an init container to clear the state.
When a new enrollment token is updated as env or envFrom in Kubernetes Manifest, this new token is not reflected in Elastic Agent. Reason for that is probably the fact, that Elastic Agent saves the state locally on every Kubernetes Node and doesn't update the new token. This leads to Unauthorised issues on the Agent - a redeploy with a new token is not possible anymore.
For confirmed bugs, please report:
Version: 8.10
Operating System: Ubuntu Linux / Kubernetes 1.27
Discuss Forum URL:
Steps to Reproduce:
Error logs:
"Failed to connect to backoff(elasticsearch(https://xxxx.xxxx.cloud.es.io:443)): 401 Unauthorized: {\"error\":{\"root_cause\":[{\"type\":\"security_exception\",\"reason\":\"unable to authenticate with provided credentials and anonymous access is not allowed for this request\",\"additional_unsuccessful_credentials\":\"API key: api key [xxxxxxx] has been invalidated\",\"header\":{\"WWW-Authenticate\":[\"Basic realm=\\\"security\\\" charset=\\\"UTF-8\\\"\",\"Bearer
How to temporary fix: When using Kustomize deployment, the hostPath can be overwritten quite easily with the following DaemonSet overwrite: