Open jade-lucas opened 1 year ago
Probably a question for the @elastic/fleet team. I am assuming the agent identity is tied to the Pod name rather than the k8s host.
Anyone found a way to solve this? Found out about this issue today when evaluating ECK.
cc @elastic/elastic-agent as well - I think we've seen similar issues around ID's in containerized environments before, but I wasn't able to find a GitHub issue that summarized where we stand. Does anyone on the agent team know of anything that might be of interest here?
Fleet managed agents are stateful, and need to persist their Fleet API key and agent ID outside of the pod file system to prevent this.
I thought ECK did this by default by putting the agent state in a host path volume mounted from the node file system: https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-elastic-agent-configuration-examples.html#k8s_storing_local_state_in_host_path_volume
This is the same thing our reference YAML does for deploying a managed Elastic agent manually: https://github.com/elastic/elastic-agent/pull/2550/files
Using a host path volume for this works consistently when the agent is a DaemonSet, because there is always one agent pod per node and the state in the file system of each node.
If you are using another deployment type there is no natural affinity between the host path volume and the running agent (if it were a deployment the agent could end up on a different node every time).
I can confirm this issue is still present.
It can be easily reproduced by applying the manifest below, then deleting the agent pod. The recreated pod will cause a new agent to appear in the Kibana Fleet UI. This manifest is exactly the same as the simplest quickstart example from the documentation, only with the emptyDir
volume removed. This way the agent state will be on a host path volume, but despite this fact the issue prevails.
Manifest:
apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
name: fleet-server-quickstart
namespace: default
spec:
version: 8.14.3
kibanaRef:
name: kibana-quickstart
elasticsearchRefs:
- name: elasticsearch-quickstart
mode: fleet
fleetServerEnabled: true
policyID: eck-fleet-server
deployment:
replicas: 1
podTemplate:
spec:
serviceAccountName: elastic-agent
automountServiceAccountToken: true
securityContext:
runAsUser: 0
---
apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
name: elastic-agent-quickstart
namespace: default
spec:
version: 8.14.3
kibanaRef:
name: kibana-quickstart
fleetServerRef:
name: fleet-server-quickstart
mode: fleet
policyID: eck-agent
daemonSet:
podTemplate:
spec:
serviceAccountName: elastic-agent
automountServiceAccountToken: true
securityContext:
runAsUser: 0
---
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
name: kibana-quickstart
namespace: default
spec:
version: 8.14.3
count: 1
elasticsearchRef:
name: elasticsearch-quickstart
config:
xpack.fleet.agents.elasticsearch.hosts: ["https://elasticsearch-quickstart-es-http.default.svc:9200"]
xpack.fleet.agents.fleet_server.hosts: ["https://fleet-server-quickstart-agent-http.default.svc:8220"]
xpack.fleet.packages:
- name: system
version: latest
- name: elastic_agent
version: latest
- name: fleet_server
version: latest
xpack.fleet.agentPolicies:
- name: Fleet Server on ECK policy
id: eck-fleet-server
namespace: default
monitoring_enabled:
- logs
- metrics
unenroll_timeout: 900
package_policies:
- name: fleet_server-1
id: fleet_server-1
package:
name: fleet_server
- name: Elastic Agent on ECK policy
id: eck-agent
namespace: default
monitoring_enabled:
- logs
- metrics
unenroll_timeout: 900
package_policies:
- name: system-1
id: system-1
package:
name: system
---
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: elasticsearch-quickstart
namespace: default
spec:
version: 8.14.3
nodeSets:
- name: default
count: 3
config:
node.store.allow_mmap: false
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: elastic-agent
rules:
- apiGroups: [""] # "" indicates the core API group
resources:
- pods
- nodes
- namespaces
verbs:
- get
- watch
- list
- apiGroups: ["coordination.k8s.io"]
resources:
- leases
verbs:
- get
- create
- update
- apiGroups: ["apps"]
resources:
- replicasets
verbs:
- list
- watch
- apiGroups: ["batch"]
resources:
- jobs
verbs:
- list
- watch
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: elastic-agent
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: elastic-agent
subjects:
- kind: ServiceAccount
name: elastic-agent
namespace: default
roleRef:
kind: ClusterRole
name: elastic-agent
apiGroup: rbac.authorization.k8s.io
Examining the pod (kubectl get pod elastic-agent-quickstart-agent-2x6rv -o yaml
) shows the host volume mount:
apiVersion: v1
kind: Pod
metadata:
annotations:
agent.k8s.elastic.co/config-hash: "777581219"
creationTimestamp: "2024-07-19T18:40:48Z"
generateName: elastic-agent-quickstart-agent-
labels:
agent.k8s.elastic.co/name: elastic-agent-quickstart
agent.k8s.elastic.co/version: 8.14.3
common.k8s.elastic.co/type: agent
controller-revision-hash: 6f567dd69d
pod-template-generation: "1"
name: elastic-agent-quickstart-agent-2x6rv
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: DaemonSet
name: elastic-agent-quickstart-agent
uid: b53e17d1-a4d3-4163-b66a-6250524a387a
resourceVersion: "1952"
uid: 1b2cfba5-dd27-4085-8eab-3c5205b9824b
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- minikube
automountServiceAccountToken: true
containers:
- command:
- /usr/bin/env
- bash
- -c
- |
#!/usr/bin/env bash
set -e
if [[ -f /mnt/elastic-internal/elasticsearch-association/default/elasticsearch-quickstart/certs/ca.crt ]]; then
if [[ -f /usr/bin/update-ca-trust ]]; then
cp /mnt/elastic-internal/elasticsearch-association/default/elasticsearch-quickstart/certs/ca.crt /etc/pki/ca-trust/source/anchors/
/usr/bin/update-ca-trust
elif [[ -f /usr/sbin/update-ca-certificates ]]; then
cp /mnt/elastic-internal/elasticsearch-association/default/elasticsearch-quickstart/certs/ca.crt /usr/local/share/ca-certificates/
/usr/sbin/update-ca-certificates
fi
fi
/usr/bin/tini -- /usr/local/bin/docker-entrypoint -e
env:
- name: FLEET_CA
value: /mnt/elastic-internal/fleetserver-association/default/fleet-server-quickstart/certs/ca.crt
- name: FLEET_ENROLL
value: "true"
- name: FLEET_ENROLLMENT_TOKEN
valueFrom:
secretKeyRef:
key: FLEET_ENROLLMENT_TOKEN
name: elastic-agent-quickstart-agent-envvars
optional: false
- name: FLEET_URL
value: https://fleet-server-quickstart-agent-http.default.svc:8220
- name: CONFIG_PATH
value: /usr/share/elastic-agent
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
image: docker.elastic.co/beats/elastic-agent:8.14.3
imagePullPolicy: IfNotPresent
name: agent
resources:
limits:
cpu: 200m
memory: 1Gi
requests:
cpu: 200m
memory: 1Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /usr/share/elastic-agent/state
name: agent-data
- mountPath: /etc/agent.yml
name: config
readOnly: true
subPath: agent.yml
- mountPath: /mnt/elastic-internal/elasticsearch-association/default/elasticsearch-quickstart/certs
name: elasticsearch-certs
readOnly: true
- mountPath: /mnt/elastic-internal/fleetserver-association/default/fleet-server-quickstart/certs
name: fleetserver-certs-1
readOnly: true
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-l6br6
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: minikube
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
runAsUser: 0
serviceAccount: elastic-agent
serviceAccountName: elastic-agent
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/pid-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
operator: Exists
volumes:
- hostPath:
path: /var/lib/elastic-agent/default/elastic-agent-quickstart/state
type: DirectoryOrCreate
name: agent-data
- name: config
secret:
defaultMode: 288
optional: false
secretName: elastic-agent-quickstart-agent-config
- name: elasticsearch-certs
secret:
defaultMode: 420
optional: false
secretName: fleet-server-quickstart-agent-es-default-elasticsearch-quickstart-ca
- name: fleetserver-certs-1
secret:
defaultMode: 420
optional: false
secretName: elastic-agent-quickstart-agent-fleetserver-ca
- name: kube-api-access-l6br6
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2024-07-19T18:40:50Z"
status: "True"
type: PodReadyToStartContainers
- lastProbeTime: null
lastTransitionTime: "2024-07-19T18:40:48Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2024-07-19T18:40:50Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2024-07-19T18:40:50Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2024-07-19T18:40:48Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://00de922573eb6e4c1950efd4045865f58ed2f5df62d1accaf31b093bf9fe58b3
image: docker.elastic.co/beats/elastic-agent:8.14.3
imageID: docker-pullable://docker.elastic.co/beats/elastic-agent@sha256:78d39a9b321ff8cfd48bbe01d7439a38517958579f172f5bd19b4caa98ca074a
lastState: {}
name: agent
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2024-07-19T18:40:49Z"
hostIP: 192.168.49.2
hostIPs:
- ip: 192.168.49.2
phase: Running
podIP: 10.244.0.10
podIPs:
- ip: 10.244.0.10
qosClass: Guaranteed
startTime: "2024-07-19T18:40:48Z"
I think I found the issue and also a workaround/fix: I believe the Fleet API key and agent ID is stored in the fleet.enc
file (difficult to verify due to it being encrypted), which is by default in /usr/share/elastic-agent
and not in the persistent state
subdirectory.
Setting the CONFIG_PATH
env to a persistent volume seems to fix the problem. This will change the paths of the fleet.enc
, elastic-agent.yml
files and the vault
directory. I think the latter two also need to be persistent, so this seems perfect. (elastic-agent.yml
is modified to enable fleet and I'd guess the vault is used for the decryption of the encrypted files.)
@cmacknz The reference YAML in the agent repo doesn't set this env either, so I would assume it has the same issue as well.
Thanks, the fleet.enc file should probably just be moved into the state path of the container to simplify the amount of configuration that needs to happen.
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)
Bug Report
What did you do?
Installed elastic agents on kubneretes cluster via DaemonSet as directed by the Fleet UI. Every time an elastic-agent pod is restarted, the agent ID changes which I think results many offline agents showing in the Fleet UI over time.
What did you expect to see? The agent to retain its Id thus only one agent to appear in Fleet UI.
What did you see instead? Under which circumstances? After every pod restart, a new agent appears even though it is on the same host.
Kubernetes information: