elastic / cloud-on-k8s

Elastic Cloud on Kubernetes
Other
2.57k stars 697 forks source link

Potentially chown Elastic Agent hostpath data directory #6239

Open naemono opened 1 year ago

naemono commented 1 year ago

There have been a number of issues/PRs concerning this issue: #5993, #6147, #6205, #6193.

The following is required when running Elastic Agent with a hostPath:

    podTemplate:
      spec:
        containers:
          - name: agent
            securityContext:
              runAsUser: 0

If not, you get this error:

Error: preparing STATE_PATH(/usr/share/elastic-agent/state) failed: mkdir /usr/share/elastic-agent/state/data: permission denied
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.5/fleet-troubleshooting.html

An initContainer that does the following allows Elastic Agent to work properly without The Agent Itself using runAsUser: 0

      initContainers:
      - command:
        - sh
        - -c
        - chown 1000:1000 /usr/share/elastic-agent/state
        image: docker.elastic.co/beats/elastic-agent:8.5.0
        imagePullPolicy: IfNotPresent
        name: permissions
        securityContext:
          runAsUser: 0

This is more complicated in a situation such as openshift where UIDs are randomized, but likely doable.

So the question is, do we pursue this path to make the UX for Elastic Agent more consistent between empty emptyDir, and hostPath?

Security Note

naemono commented 1 year ago

After discussion, we've decided to take the approach of using an init container to make this user experience better. Since the gid in openshift is known, we'll take this approach:

      initContainers:
      - command:
        - sh
        - -c
        - chmod g+w /usr/share/elastic-agent/state && chgrp 1000 /usr/share/elastic-agent/state
        image: docker.elastic.co/beats/elastic-agent:8.5.0
        imagePullPolicy: IfNotPresent
        name: permissions
        securityContext:
          runAsUser: 0
brsolomon-deloitte commented 1 year ago

Also related: https://github.com/elastic/cloud-on-k8s/issues/6280

gittihub123 commented 1 year ago

Hi @naemono I have been stuck with this issue for a couple of days and can't get it working. We are using Openshift 4.12 & argoCD with the elastic operator in Openshift.

I followed the official eck k8s 2.6 documentation and created the required resources.

Worth mentioning is that we implemented the compliance operator and have used the CIS operator to hardening the platform.

apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: fleet-server-dev
  namespace: elastic-dev
spec:
  version: 8.6.1
  kibanaRef:
    name: kibanadev
  elasticsearchRefs:
  - name: esdev01
  mode: fleet
  fleetServerEnabled: true
  deployment:
    replicas: 1
    podTemplate:
      spec:
        serviceAccountName: elastic-agent
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0
---
apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: elastic-agent-dev
  namespace: elastic-dev
spec:
  version: 8.6.1
  kibanaRef:
    name: kibanadev
  fleetServerRef:
    name: fleet-server-dev
  mode: fleet
  daemonSet:
    podTemplate:
      spec:
        serviceAccountName: elastic-agent
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: elastic-agent
rules:
- apiGroups: [""] # "" indicates the core API group
  resources:
  - pods
  - nodes
  - namespaces
  verbs:
  - get
  - watch
  - list
- apiGroups: ["coordination.k8s.io"]
  resources:
  - leases
  verbs:
  - get
  - create
  - update
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: elastic-agent
  namespace: elastic-dev
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: elastic-agent
subjects:
- kind: ServiceAccount
  name: elastic-agent
  namespace: elastic-dev
roleRef:
  kind: Role
  name: elastic-agent
  apiGroup: rbac.authorization.k8s.io

Rolebinding

Name:         elastic-agent-rb
Labels:       <none>
Annotations:  <none>
Role:
  Kind:  ClusterRole
  Name:  system:openshift:scc:privileged
Subjects:
  Kind            Name           Namespace
  ----            ----           ---------
  ServiceAccount  elastic-agent  elastic-dev

The hostpath is created on the physical machine but we are still getting permissions denied!

Error: preparing STATE_PATH(/usr/share/elastic-agent/state) failed: mkdir /usr/share/elastic-agent/state/data: permission denied
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.6/fleet-troubleshooting.html
naemono commented 1 year ago

@gittihub123 I'll investigate this and get back to you.

naemono commented 1 year ago

@gittihub123 The below appears to be required in the case of openshift:

  deployment:
    replicas: 1
    podTemplate:
      spec:
        containers:
        - name: agent
          securityContext:  
            privileged: true <==== This is the piece that's required in openshift
gittihub123 commented 1 year ago

Hi @naemono This does not work on Openshift cluster because SElinux block it from creating files on the host filesystem.

The same applies when I try to create a standalone filebeat instance with this configuration.

# CRD to create beats with ECK (Pod(s))
apiVersion: beat.k8s.elastic.co/v1beta1
kind: Beat
metadata:
   name: panos-filebeat
   namespace: elastic-dev
spec:
  type: filebeat
  version: 8.6.1
  elasticsearchRef:
    name: esdev
  kibanaRef:
    name: kibanadev
  config:
    filebeat.modules:
    - module: panw
      panos:
        enabled: true
        var.syslog_host: 0.0.0.0
  daemonSet:
    podTemplate:
      spec:
        serviceAccountName: filebeat
        automountServiceAccountToken: true
        securityContext:
          privileged: true

Error message

one or more objects failed to apply, reason: admission webhook "elastic-beat-validation-v1beta1.k8s.elastic.co" denied the request: Beat.beat.k8s.elastic.co "panos-filebeat" is invalid: privileged: Invalid value: "privileged": privileged field found in the kubectl.kubernetes.io/last-applied-configuration annotation is unknown. This is often due to incorrect indentation in the manifest.
naemono commented 1 year ago

@gittihub123 Running Agent and/or Beat in an openshift environment has many more complexities than running in a standard Kubernetes environment. We document these issues here. We also have some beats recipes that we use in our e2e tests that we run on a regular basis here. I just successfully deployed this beat recipe on an openshift 4.9 cluster, following our documentation noted above, specifically:

oc adm policy add-scc-to-user privileged -z filebeat -n elastic

Then applied this manifest, which worked after a bit of time (beat pods crash once or twice while users/api keys are being propagated throughout the Elastic stack)

apiVersion: beat.k8s.elastic.co/v1beta1
kind: Beat
metadata:
  name: filebeat
spec:
  type: filebeat
  version: 8.6.0
  elasticsearchRef:
    name: testing
  kibanaRef:
    name: kibana
  config:
    filebeat.autodiscover.providers:
    - node: ${NODE_NAME}
      type: kubernetes
      hints.default_config.enabled: "false"
      templates:
      - condition.equals.kubernetes.namespace: log-namespace
        config:
        - paths: ["/var/log/containers/*${data.kubernetes.container.id}.log"]
          type: container
      - condition.equals.kubernetes.labels.log-label: "true"
        config:
        - paths: ["/var/log/containers/*${data.kubernetes.container.id}.log"]
          type: container
    processors:
    - add_cloud_metadata: {}
    - add_host_metadata: {}
  daemonSet:
    podTemplate:
      spec:
        serviceAccountName: filebeat
        automountServiceAccountToken: true
        terminationGracePeriodSeconds: 30
        # dnsPolicy: ClusterFirstWithHostNet
        # hostNetwork: true # Allows to provide richer host metadata
        containers:
        - name: filebeat
          securityContext:
            runAsUser: 0
            # If using Red Hat OpenShift uncomment this:
            privileged: true
          volumeMounts:
          - name: varlogcontainers
            mountPath: /var/log/containers
          - name: varlogpods
            mountPath: /var/log/pods
          - name: varlibdockercontainers
            mountPath: /var/lib/docker/containers
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
        volumes:
        - name: varlogcontainers
          hostPath:
            path: /var/log/containers
        - name: varlogpods
          hostPath:
            path: /var/log/pods
        - name: varlibdockercontainers
          hostPath:
            path: /var/lib/docker/containers
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: filebeat
rules:
- apiGroups: [""] # "" indicates the core API group
  resources:
  - namespaces
  - pods
  - nodes
  verbs:
  - get
  - watch
  - list
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: filebeat
  namespace: elastic
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: filebeat
subjects:
- kind: ServiceAccount
  name: filebeat
  namespace: elastic
roleRef:
  kind: ClusterRole
  name: filebeat
  apiGroup: rbac.authorization.k8s.io
# ---
# My Elasticsearch cluster already existed....
# apiVersion: elasticsearch.k8s.elastic.co/v1
# kind: Elasticsearch
# metadata:
#   name: elasticsearch
# spec:
#   version: 8.6.1
#   nodeSets:
#   - name: default
#     count: 3
#     config:
#       node.store.allow_mmap: false
---
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: kibana
spec:
  version: 8.6.0
  count: 1
  elasticsearchRef:
    name: testing
# ...

Note the difference in the daemonset.podTemplate.spec and where the securityContext is applied:

  daemonSet:
    podTemplate:
      spec:
        serviceAccountName: filebeat
        automountServiceAccountToken: true
        terminationGracePeriodSeconds: 30
        # dnsPolicy: ClusterFirstWithHostNet
        # hostNetwork: true # Allows to provide richer host metadata
        containers:
        - name: filebeat
          securityContext:
            runAsUser: 0
            # If using Red Hat OpenShift uncomment this:
            privileged: true
gittihub123 commented 1 year ago

Hi @naemono Thank you for the explaination. The filebeat work now but our goal is to implement elastic agent and activate different types of modules to collect syslog from outside of the cluster, from palo alto, cisco FTD, Cisco ASA etc.

So far, the elastic-agent is running and is managed by fleet but it's only collecting logs from Openshift (logs/metrics). The elastic stack is running in the same namespaces and I have connection between all pods (Elasticsearch, kibana, fleet & elastic-agent).

This is my configuration

apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: {{ .Values.kibana.name }}
  namespace: {{ .Values.namespace }}
spec:
  http:
    tls:
      certificate:
        secretName: {{ .Values.tls.certificate }}
  config:
    server.publicBaseUrl: "https://XXX.YYY.ZZZ/"
    xpack.fleet.agents.elasticsearch.hosts: ["https://esdev-es-http.elastic-dev.svc:9200"]
    xpack.fleet.agents.fleet_server.hosts: ["https://fleet-server-dev-agent-http.elastic-dev.svc:8220"]
    xpack.fleet.packages:
      - name: system
        version: latest
      - name: elastic_agent
        version: latest
      - name: fleet_server
        version: latest
    xpack.fleet.agentPolicies:
      - name: Fleet Server test
        id: eck-fleet-server
        is_default_fleet_server: true
        namespace: agent
        monitoring_enabled:
          - logs
          - metrics
        package_policies:
        - name: fleet_server-1
          id: fleet_server-1
          package:
            name: fleet_server
      - name: Elastic Agent on ECK policy
        id: eck-agent
        namespace: agent
        monitoring_enabled:
          - logs
          - metrics
        unenroll_timeout: 900
        is_default: true
        package_policies:
          - name: system-1
            id: system-1
            package:
              name: system
          - name: CiscoFTD
            id: CiscoFTD
            package:
              name: Cisco FTD
          - name: palo-alto
            id: palo-alto
            package:
              name: panos
  version: {{ .Values.version }}
  count: {{ .Values.kibana.nodes }}
  elasticsearchRef:
    name: {{ .Values.name }}
  podTemplate:
    spec:
      containers:
      - name: kibana
        resources:
          limits:
            memory: {{ .Values.kibana.resources.limits.memory }}
            cpu: {{ .Values.kibana.resources.limits.cpu }}

I believe the network flow would be something like this right?

Syslog source (ciscoFTD, palo alto etc) -> Openshift route (for example ciscoftd.dev.test.com) -> elastic agent SVC (created by me to expose the elastic agents) -> elastic-agent pods.

This should be possible or should we try to do it another way?

Thanks.

naemono commented 1 year ago

Syslog source (ciscoFTD, palo alto etc) -> Openshift route (for example ciscoftd.dev.test.com) -> elastic agent SVC (created by me to expose the elastic agents) -> elastic-agent pods.

This solution makes sense to me using a custom tcp agent integration...

ebuildy commented 1 year ago

the solution will not work if you use a keystore !

Because operator append an initContainer before the permissions container ....