bottlerocket-os / bottlerocket

An operating system designed for hosting containers
https://bottlerocket.dev
Other
8.59k stars 506 forks source link

fsGroupChangePolicy appears to be ignored #3151

Open stk0vrfl0w opened 1 year ago

stk0vrfl0w commented 1 year ago

Image I'm using: ami-0f99e88195df133a4 -- EKS 1.23

What I expected to happen: When setting the following securityContext for a pod, I expected that a recursive chown operation on the attached EBS volume would not happen. Both AL2 & Ubuntu EKS-tuned AMIs appear to properly respect the setting.

securityContext:
  runAsUser: 1000
  runAsGroup: 1000
  fsGroup: 1000
  fsGroupChangePolicy: "OnRootMismatch"

What actually happened: All of a pods containers remain stuck in the initializing state until a recursive chown on the attached persistent volume is completed -- at which point the containers will then begin their Initialization. For volumes with a large amount of files, this can take a significant amount of time.

As an example, we run Jenkins on several clusters -- which stores state as lots of small files on disk. Our average deployment has anywhere between 10 - 20 million files on disk which takes 45 minutes or more to complete the chown process when using the BottleRocket AMIs chosen by Karpenter. As stated before, switching the AMI to be Ubuntu or AL2 properly respects the setting and the volume is available for use after ~ 10 seconds.

How to reproduce the problem: Create a PVC that's attached to a Deployment or StatefulSet and generate a million small files on the volume. Forcing a restart of the deployment or statefulset should result in a few minute delay while waiting for the chown to complete.

rpkelly commented 1 year ago

Hi stk0vrfl0w, thanks for the report!

This seems very similar to #2656 which was fixed earlier this year as part of the 1.12.0 release. If you could, would you confirm if the fix that @bcressey describes here works for you?

stk0vrfl0w commented 1 year ago

Thanks for the pointer, @rpkelly! If I understand the solution correctly, it seems that we'd need to migrate the AWS EBS CSI driver from being an EKS managed add-on to either (a) deploying it via a kustomized helm chart or (b) patching after upgrades. This is because the chart does not (yet) support adding custom mount points.

yeazelm commented 1 year ago

Hey @stk0vrfl0w! I think you are correct, the current situation is that patching the AWS EBS CSI driver is the most accessible option atm, I cut https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1544 which will hopefully add proper support to the driver so you don't need to patch it. It would be good to confirm this SELinux-specific fix solves your problem. It might not be the complete fix but I think this patch would reduce your attach time by a significant amount so I wanted to have you try that first. There might also be a recursive chown, but the SELinux relabel is known to cause the symptoms you are experiencing when volumes have lots of files.

stk0vrfl0w commented 1 year ago

I'll try out the suggestions as soon as I can. Unfortunately, since all the environments where it affects us are production, I can't perform any testing there. So, I'm working on getting approvals for setting up a testing environment.

stk0vrfl0w commented 1 year ago

I was finally able to set up a test environment, but the SELinux specific fix doesn't appear to have addressed the issue.

As an added wrinkle -- even if the SELinux fix did work, our environments use a small EKS managed node group whose AMIs aren't running SELinux in enforcement. On those hosts, the fix prevents the CSI driver from properly running because the expected directories/files are missing.

etungsten commented 1 year ago

Hi @stk0vrfl0w,

Can you please describe in more detail what steps you took to apply the SELinux skip relabeling fix? The description in https://github.com/bottlerocket-os/bottlerocket/issues/2656#issuecomment-1408912457 might not be super clear.

Without the fix, I was able to observe the long pod start-up times with an EBS volume with a few million files. After applying the SELinux mount fix, the pod start up time went from 10 mins for 2 million files down to 3 seconds.

Two million files on the attached volume

bash-4.4$ df -hi ./
Filesystem     Inodes IUsed IFree IUse% Mounted on
/dev/nvme3n1     2.1M  2.1M     0  100% /data

Before SELinux mounts in the EBS node agent pod:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-07-12T23:42:47Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-07-12T23:51:50Z"
    status: "True"
    type: Ready

After SELinux mounts in the EBS node agent pod:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-07-13T19:13:47Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-07-13T19:13:50Z"
    status: "True"
    type: Ready

Here are the steps I did to get the EBS node agent to skip SELinux relabeling: The latest EBS CSI driver release lets you customize the volume mounts in the CSI node pods via the helm chart.

Please try that out and let us know if it works for you.

Regarding your concern here:

As an added wrinkle -- even if the SELinux fix did work, our environments use a small EKS managed node group whose AMIs aren't running SELinux in enforcement. On those hosts, the fix prevents the CSI driver from properly running because the expected directories/files are missing.

A possible solution here would be to apply taints to your nodes so that you can apply a Bottlerocket-specific EBS CSI node daemonset and a non-Bottlerocket specific EBS CSI node daemonset.

yeazelm commented 8 months ago

I just wanted to follow up on this issue since heard offline that folks might be using this for solving slow attachment issues with lots of tiny files. @etungsten has an excellent write up on how to get this to work but I've noticed in some situations the mountOptions might need to be specified at the persistent volume rather (or in additional) to the storage class. If the storage class alone doesn't work for you, adding it to the persistent volume looks to solve the problem. @stk0vrfl0w do you know if these steps solved your original concern?

Cytrian commented 8 months ago

It looks like the mountOptions are copied from the StorageClass to the PV upon creation. So if the PV already exists, you need to manually update the PV. New PVs will already have the option.

janre commented 8 months ago

@etungsten Thank you! Same applies to vsphere-csi-driver in combination with eks anywhere.

kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: vsphere-csi-node
  namespace: vmware-system-csi
spec:
  selector:
    matchLabels:
      app: vsphere-csi-node
  updateStrategy:
    type: "RollingUpdate"
    rollingUpdate:
      maxUnavailable: 1
  template:
    metadata:
      labels:
        app: vsphere-csi-node
        role: vsphere-csi
    spec:
      priorityClassName: system-node-critical
      nodeSelector:
        kubernetes.io/os: linux
      serviceAccountName: vsphere-csi-node
      hostNetwork: true
      dnsPolicy: "ClusterFirstWithHostNet"
      containers:
        - name: node-driver-registrar
          image: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.8.0
          args:
            - "--v=5"
            - "--csi-address=$(ADDRESS)"
            - "--kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)"
          env:
            - name: ADDRESS
              value: /csi/csi.sock
            - name: DRIVER_REG_SOCK_PATH
              value: /var/lib/kubelet/plugins/csi.vsphere.vmware.com/csi.sock
          volumeMounts:
            - name: plugin-dir
              mountPath: /csi
            - name: registration-dir
              mountPath: /registration
            # SELinux specific volume mounts
            - mountPath: /sys/fs/selinux
              name: selinuxfs
            - mountPath: /etc/selinux/config
              name: selinux-config
          livenessProbe:
            exec:
              command:
              - /csi-node-driver-registrar
              - --kubelet-registration-path=/var/lib/kubelet/plugins/csi.vsphere.vmware.com/csi.sock
              - --mode=kubelet-registration-probe
            initialDelaySeconds: 3
        - name: vsphere-csi-node
          image: gcr.io/cloud-provider-vsphere/csi/release/driver:v3.1.2
          args:
            - "--fss-name=internal-feature-states.csi.vsphere.vmware.com"
            - "--fss-namespace=$(CSI_NAMESPACE)"
          imagePullPolicy: "Always"
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: CSI_ENDPOINT
              value: unix:///csi/csi.sock
            - name: MAX_VOLUMES_PER_NODE
              value: "59" # Maximum number of volumes that controller can publish to the node. If value is not set or zero Kubernetes decide how many volumes can be published by the controller to the node.
            - name: X_CSI_MODE
              value: "node"
            - name: X_CSI_SPEC_REQ_VALIDATION
              value: "false"
            - name: X_CSI_SPEC_DISABLE_LEN_CHECK
              value: "true"
            - name: LOGGER_LEVEL
              value: "PRODUCTION" # Options: DEVELOPMENT, PRODUCTION
            - name: GODEBUG
              value: x509sha1=1
            - name: CSI_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: NODEGETINFO_WATCH_TIMEOUT_MINUTES
              value: "1"
          securityContext:
            privileged: true
            capabilities:
              add: ["SYS_ADMIN"]
            allowPrivilegeEscalation: true
          volumeMounts:
            - name: plugin-dir
              mountPath: /csi
            - name: pods-mount-dir
              mountPath: /var/lib/kubelet
              # needed so that any mounts setup inside this container are
              # propagated back to the host machine.
              mountPropagation: "Bidirectional"
            - name: device-dir
              mountPath: /dev
            - name: blocks-dir
              mountPath: /sys/block
            - name: sys-devices-dir
              mountPath: /sys/devices
          ports:
            - name: healthz
              containerPort: 9808
              protocol: TCP
          livenessProbe:
            httpGet:
              path: /healthz
              port: healthz
            initialDelaySeconds: 10
            timeoutSeconds: 5
            periodSeconds: 5
            failureThreshold: 3
        - name: liveness-probe
          image: registry.k8s.io/sig-storage/livenessprobe:v2.10.0
          args:
            - "--v=4"
            - "--csi-address=/csi/csi.sock"
          volumeMounts:
            - name: plugin-dir
              mountPath: /csi
      volumes:
        - name: registration-dir
          hostPath:
            path: /var/lib/kubelet/plugins_registry
            type: Directory
        - name: plugin-dir
          hostPath:
            path: /var/lib/kubelet/plugins/csi.vsphere.vmware.com
            type: DirectoryOrCreate
        - name: pods-mount-dir
          hostPath:
            path: /var/lib/kubelet
            type: Directory
        - name: device-dir
          hostPath:
            path: /dev
        - name: blocks-dir
          hostPath:
            path: /sys/block
            type: Directory
        - name: sys-devices-dir
          hostPath:
            path: /sys/devices
            type: Directory
        # SELinux specific volumes
        - name: selinuxfs 
          hostPath:
            path: /sys/fs/selinux
            type: Directory
        - name: selinux-config
          hostPath:
            path: /etc/selinux/config
            type: FileOrCreate
      tolerations:
        - effect: NoExecute
          operator: Exists
        - effect: NoSchedule
          operator: Exists