elastic / cloud-on-k8s

Elastic Cloud on Kubernetes
Other
59 stars 708 forks source link

Error: preparing STATE_PATH(/usr/share/elastic-agent/state) failed: mkdir /usr/share/elastic-agent/state/data: permission denied #6193

Closed mmentges closed 1 year ago

mmentges commented 1 year ago

Bug Report

First of all, not sure if it is a bug or if I broke something completly on my own. Please let me know if i am just to stupid to understand the documentation. I think I got some trouble concerning hostPath which was working somehow until I removed everything manually.

What did you do? Added ECK Fleet Server Agent and Elastic Agent to my config.yaml Added Kibana Config with xpack.fleet.agents.elasticsearch.hosts, xpack.fleet.agents.fleet_server.hosts, xpack.fleet.packages, xpack.fleet.agentPolicies

What did you expect to see? A running ECK Fleet Server Agent Pod and ECK Agent Pods for each worker node

What did you see instead? Under which circumstances?

On the first try everything seemed to get started fine, but then I wanted to modify the config a bit more so I the removed the agent instances via Openshift UI API explorer, all Policies in Kibana with the Kibana UI and the uninstalled the integrations.

After that, I tried to redeploy and had several issues. The agent policies were not created automatically, so I did it manually with the Kibana UI. After that the agent pods got created and started again. But the Fleet Server Pod is failing on the startup. I added the log below.

I tried several scc settings for the service account. Right now it is set to scc:hostmount-anyuid

So there are actually two issues. The first one I fixed manually (Kibana not creating the agent policies again after redeploy and the second one is the startup of the fleet-server pod.

Environment

$ kubectl version

Client Version: v1.25.0 Kustomize Version: v4.5.7 Server Version: v1.22.8+9e95cb9

apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: kibana
spec:
  version: 8.5.1
  count: 1
  elasticsearchRef:
    name: elasticsearch
  config:
    xpack.fleet.agents.elasticsearch.hosts: ["https://elasticsearch-es-http.elastic.svc:9200"]
    xpack.fleet.agents.fleet_server.hosts: ["https://fleet-server-agent-http.elastic.svc:8220"]
    xpack.fleet.packages:
    - name: system
      version: latest
    - name: elastic_agent
      version: latest
    - name: fleet_server
      version: latest
    - name: kubernetes
      version: latest
    xpack.fleet.agentPolicies:
    - name: Fleet Server on ECK policy
      id: eck-fleet-server
      namespace: default
      monitoring_enabled:
      - logs
      - metrics
      unenroll_timeout: 900
      is_default_fleet_server: true
      package_policies:
      - name: fleet_server-1
        id: fleet_server-1
        package:
          name: fleet_server
    - name: Elastic Agent on ECK policy
      id: eck-agent
      namespace: default
      monitoring_enabled:
      - logs
      - metrics
      unenroll_timeout: 900
      is_default: true
      package_policies:
      - package:
          name: system
        name: system-1
      - package:
          name: kubernetes
        name: kubernetes-1
apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: fleet-server
spec:
  version: 8.5.1
  kibanaRef:
    name: kibana
  elasticsearchRefs:
  - name: elasticsearch
  mode: fleet
  policyID: 8b9d06b0-6b6b-11ed-be9d-f7fb2b8e527b
  fleetServerEnabled: true
  deployment:
    replicas: 1
    podTemplate:
      spec:
        serviceAccountName: fleet-server
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0
apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata: 
  name: elastic-agent
spec:
  version: 8.5.1
  kibanaRef:
    name: kibana
  fleetServerRef: 
    name: fleet-server
  mode: fleet
  policyID: 14c54450-6bd2-11ed-99c4-8d1ea8b60d25
  daemonSet:
    podTemplate:
      spec:
        serviceAccountName: elastic-agent
        hostNetwork: true
        dnsPolicy: ClusterFirstWithHostNet
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: fleet-server
rules:
- apiGroups: [""]
  resources:
  - pods
  - namespaces
  - nodes
  verbs:
  - get
  - watch
  - list
- apiGroups: ["coordination.k8s.io"]
  resources:
  - leases
  verbs:
  - get
  - create
  - update
apiVersion: v1
kind: ServiceAccount
metadata:
  name: fleet-server
  namespace: elastic
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: fleet-server
subjects:
- kind: ServiceAccount
  name: fleet-server
  namespace: elastic
roleRef:
  kind: ClusterRole
  name: fleet-server
  apiGroup: rbac.authorization.k8s.io
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: elastic-agent
rules:
- apiGroups: [""]
  resources:
  - pods
  - nodes
  - namespaces
  - events
  - services
  - configmaps
  verbs:
  - get
  - watch
  - list
- apiGroups: ["coordination.k8s.io"]
  resources:
  - leases
  verbs:
  - get
  - create
  - update
- nonResourceURLs:
  - "/metrics"
  verbs:
  - get
- apiGroups: ["extensions"]
  resources:
    - replicasets
  verbs: 
  - "get"
  - "list"
  - "watch"
- apiGroups:
  - "apps"
  resources:
  - statefulsets
  - deployments
  - replicasets
  verbs:
  - "get"
  - "list"
  - "watch"
- apiGroups:
  - ""
  resources:
  - nodes/stats
  verbs:
  - get
- apiGroups:
  - "batch"
  resources:
  - jobs
  verbs:
  - "get"
  - "list"
  - "watch"
apiVersion: v1
kind: ServiceAccount
metadata:
  name: elastic-agent
  namespace: elastic
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: elastic-agent
subjects:
- kind: ServiceAccount
  name: elastic-agent
  namespace: elastic
roleRef:
  kind: ClusterRole
  name: elastic-agent
  apiGroup: rbac.authorization.k8s.io
rigl1 commented 1 year ago

Hi, i have the same issue! k8s deployment on premise. Error: preparing STATE_PATH(/usr/share/elastic-agent/state) failed: mkdir /usr/share/elastic-agent/state/data: permission denied

thbkrkr commented 1 year ago

Hello,

Yes, it is very likely related to the breaking change made in 2.4:

As BenB196 pointed out in #5993, we forgot to document the need to run the container as root when using the hostpath. An alternative to not running as root is to use an emptyDir volume.

The documentation has now been updated to reflect this:

Apologies for the inconvenience.

Feel free to reopen if there is anything else.

barkbay commented 1 year ago

As BenB196 pointed out in https://github.com/elastic/cloud-on-k8s/issues/5993, we forgot to document the need to run the container as root when using the hostpath.

What I don't understand is that user here was running the Pod as root:

apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: fleet-server
spec:
  version: 8.5.1
  kibanaRef:
    name: kibana
  elasticsearchRefs:
  - name: elasticsearch
  mode: fleet
  policyID: 8b9d06b0-6b6b-11ed-be9d-f7fb2b8e527b
  fleetServerEnabled: true
  deployment:
    replicas: 1
    podTemplate:
      spec:
        serviceAccountName: fleet-server
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0
---
apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata: 
  name: elastic-agent
spec:
  version: 8.5.1
  kibanaRef:
    name: kibana
  fleetServerRef: 
    name: fleet-server
  mode: fleet
  policyID: 14c54450-6bd2-11ed-99c4-8d1ea8b60d25
  daemonSet:
    podTemplate:
      spec:
        serviceAccountName: elastic-agent
        hostNetwork: true
        dnsPolicy: ClusterFirstWithHostNet
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0
QuinnBast commented 2 months ago

Agreed. The user's manifest has the runAsUser in it so this is not really solved?

I am also running my containers as root using the security context described above but I'm also getting this issue on 8.15.0...

It seems the fix is that you also need to have the emptydir volume applied to both the fleet-agents AND the fleet-server (while the recommended manifests to install on the elastic website only shows the empty-dir being present on the fleet-agent.

        volumes:
          - name: agent-data
            emptyDir: { }

Though, I will say, I'm not sure the consequences of this. empty-dir volumes are not persistent so if one of my containers fails is that going to be a problem?