fsGroupChangePolicy appears to be ignored

stk0vrfl0w commented 1 year ago

Image I'm using: ami-0f99e88195df133a4 -- EKS 1.23

What I expected to happen: When setting the following securityContext for a pod, I expected that a recursive chown operation on the attached EBS volume would not happen. Both AL2 & Ubuntu EKS-tuned AMIs appear to properly respect the setting.

https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#configure-volume-permission-and-ownership-change-policy-for-pods

securityContext:
  runAsUser: 1000
  runAsGroup: 1000
  fsGroup: 1000
  fsGroupChangePolicy: "OnRootMismatch"

What actually happened: All of a pods containers remain stuck in the initializing state until a recursive chown on the attached persistent volume is completed -- at which point the containers will then begin their Initialization. For volumes with a large amount of files, this can take a significant amount of time.

As an example, we run Jenkins on several clusters -- which stores state as lots of small files on disk. Our average deployment has anywhere between 10 - 20 million files on disk which takes 45 minutes or more to complete the chown process when using the BottleRocket AMIs chosen by Karpenter. As stated before, switching the AMI to be Ubuntu or AL2 properly respects the setting and the volume is available for use after ~ 10 seconds.

How to reproduce the problem: Create a PVC that's attached to a Deployment or StatefulSet and generate a million small files on the volume. Forcing a restart of the deployment or statefulset should result in a few minute delay while waiting for the chown to complete.

rpkelly commented 1 year ago

Hi stk0vrfl0w, thanks for the report!

This seems very similar to #2656 which was fixed earlier this year as part of the 1.12.0 release. If you could, would you confirm if the fix that @bcressey describes here works for you?

stk0vrfl0w commented 1 year ago

Thanks for the pointer, @rpkelly! If I understand the solution correctly, it seems that we'd need to migrate the AWS EBS CSI driver from being an EKS managed add-on to either (a) deploying it via a kustomized helm chart or (b) patching after upgrades. This is because the chart does not (yet) support adding custom mount points.

https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/charts/aws-ebs-csi-driver/templates/node.yaml

yeazelm commented 1 year ago

Hey @stk0vrfl0w! I think you are correct, the current situation is that patching the AWS EBS CSI driver is the most accessible option atm, I cut https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1544 which will hopefully add proper support to the driver so you don't need to patch it. It would be good to confirm this SELinux-specific fix solves your problem. It might not be the complete fix but I think this patch would reduce your attach time by a significant amount so I wanted to have you try that first. There might also be a recursive chown, but the SELinux relabel is known to cause the symptoms you are experiencing when volumes have lots of files.

stk0vrfl0w commented 1 year ago

I'll try out the suggestions as soon as I can. Unfortunately, since all the environments where it affects us are production, I can't perform any testing there. So, I'm working on getting approvals for setting up a testing environment.

stk0vrfl0w commented 1 year ago

I was finally able to set up a test environment, but the SELinux specific fix doesn't appear to have addressed the issue.

As an added wrinkle -- even if the SELinux fix did work, our environments use a small EKS managed node group whose AMIs aren't running SELinux in enforcement. On those hosts, the fix prevents the CSI driver from properly running because the expected directories/files are missing.

etungsten commented 1 year ago

Hi @stk0vrfl0w,

Can you please describe in more detail what steps you took to apply the SELinux skip relabeling fix? The description in https://github.com/bottlerocket-os/bottlerocket/issues/2656#issuecomment-1408912457 might not be super clear.

Without the fix, I was able to observe the long pod start-up times with an EBS volume with a few million files. After applying the SELinux mount fix, the pod start up time went from 10 mins for 2 million files down to 3 seconds.

Two million files on the attached volume

bash-4.4$ df -hi ./
Filesystem     Inodes IUsed IFree IUse% Mounted on
/dev/nvme3n1     2.1M  2.1M     0  100% /data

Before SELinux mounts in the EBS node agent pod:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-07-12T23:42:47Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-07-12T23:51:50Z"
    status: "True"
    type: Ready

After SELinux mounts in the EBS node agent pod:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-07-13T19:13:47Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-07-13T19:13:50Z"
    status: "True"
    type: Ready

Here are the steps I did to get the EBS node agent to skip SELinux relabeling: The latest EBS CSI driver release lets you customize the volume mounts in the CSI node pods via the helm chart.

Follow steps in https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/install.md#helm to add the helm chart.

Create a values.yaml file based on https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/charts/aws-ebs-csi-driver/values.yaml and edit volumes and volumeMounts following under the node section:

# Add additional volume mounts on the node pods with node.volumes and node.volumeMounts
volumes:
- hostPath:
    path: /sys/fs/selinux
    type: Directory
  name: selinuxfs
- hostPath:
    path: /etc/selinux/config
    type: FileOrCreate
  name: selinux-config

# Add additional volumes to be mounted onto the node pods:
# - name: custom-dir
#   hostPath:
#     path: /path/to/dir
#     type: Directory
volumeMounts:
# SELinux specific volume mounts
- mountPath: /sys/fs/selinux
 name: selinuxfs
- mountPath: /etc/selinux/config
 name: selinux-config

Install the aws-ebs-csi-driver helm chart:

helm upgrade --install aws-ebs-csi-driver \
--namespace kube-system \
aws-ebs-csi-driver/aws-ebs-csi-driver -f values.yaml

Check the ebs-csi-node daemonset and ensure the mounts are there:

$ kubectl describe ds/ebs-csi-node -n kube-system | grep selinux -A 5
  /etc/selinux/config from selinux-config (rw)
  /sys/fs/selinux from selinuxfs (rw)
  /var/lib/kubelet from kubelet-dir (rw)
node-driver-registrar:
Image:      public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar:v2.8.0-eks-1-27-7
Port:       <none>
Host Port:  <none>
--
selinuxfs:
Type:          HostPath (bare host directory volume)
Path:          /sys/fs/selinux
HostPathType:  Directory
selinux-config:
Type:               HostPath (bare host directory volume)
Path:               /etc/selinux/config
HostPathType:       FileOrCreate
Priority Class Name:  system-node-critical
Events:
Type    Reason            Age   From                  Message
----    ------            ----  ----                  -------

Specify mount options in the storage class for the CSI driver:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ebs-sc
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
parameters:
fsType: ext4
type: gp3
mountOptions:
- context="system_u:object_r:local_t:s0"

Please try that out and let us know if it works for you.

Regarding your concern here:

As an added wrinkle -- even if the SELinux fix did work, our environments use a small EKS managed node group whose AMIs aren't running SELinux in enforcement. On those hosts, the fix prevents the CSI driver from properly running because the expected directories/files are missing.

A possible solution here would be to apply taints to your nodes so that you can apply a Bottlerocket-specific EBS CSI node daemonset and a non-Bottlerocket specific EBS CSI node daemonset.

yeazelm commented 8 months ago

I just wanted to follow up on this issue since heard offline that folks might be using this for solving slow attachment issues with lots of tiny files. @etungsten has an excellent write up on how to get this to work but I've noticed in some situations the mountOptions might need to be specified at the persistent volume rather (or in additional) to the storage class. If the storage class alone doesn't work for you, adding it to the persistent volume looks to solve the problem. @stk0vrfl0w do you know if these steps solved your original concern?

Cytrian commented 8 months ago

It looks like the mountOptions are copied from the StorageClass to the PV upon creation. So if the PV already exists, you need to manually update the PV. New PVs will already have the option.

janre commented 8 months ago

@etungsten Thank you! Same applies to vsphere-csi-driver in combination with eks anywhere.

kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: vsphere-csi-node
  namespace: vmware-system-csi
spec:
  selector:
    matchLabels:
      app: vsphere-csi-node
  updateStrategy:
    type: "RollingUpdate"
    rollingUpdate:
      maxUnavailable: 1
  template:
    metadata:
      labels:
        app: vsphere-csi-node
        role: vsphere-csi
    spec:
      priorityClassName: system-node-critical
      nodeSelector:
        kubernetes.io/os: linux
      serviceAccountName: vsphere-csi-node
      hostNetwork: true
      dnsPolicy: "ClusterFirstWithHostNet"
      containers:
        - name: node-driver-registrar
          image: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.8.0
          args:
            - "--v=5"
            - "--csi-address=$(ADDRESS)"
            - "--kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)"
          env:
            - name: ADDRESS
              value: /csi/csi.sock
            - name: DRIVER_REG_SOCK_PATH
              value: /var/lib/kubelet/plugins/csi.vsphere.vmware.com/csi.sock
          volumeMounts:
            - name: plugin-dir
              mountPath: /csi
            - name: registration-dir
              mountPath: /registration
            # SELinux specific volume mounts
            - mountPath: /sys/fs/selinux
              name: selinuxfs
            - mountPath: /etc/selinux/config
              name: selinux-config
          livenessProbe:
            exec:
              command:
              - /csi-node-driver-registrar
              - --kubelet-registration-path=/var/lib/kubelet/plugins/csi.vsphere.vmware.com/csi.sock
              - --mode=kubelet-registration-probe
            initialDelaySeconds: 3
        - name: vsphere-csi-node
          image: gcr.io/cloud-provider-vsphere/csi/release/driver:v3.1.2
          args:
            - "--fss-name=internal-feature-states.csi.vsphere.vmware.com"
            - "--fss-namespace=$(CSI_NAMESPACE)"
          imagePullPolicy: "Always"
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: CSI_ENDPOINT
              value: unix:///csi/csi.sock
            - name: MAX_VOLUMES_PER_NODE
              value: "59" # Maximum number of volumes that controller can publish to the node. If value is not set or zero Kubernetes decide how many volumes can be published by the controller to the node.
            - name: X_CSI_MODE
              value: "node"
            - name: X_CSI_SPEC_REQ_VALIDATION
              value: "false"
            - name: X_CSI_SPEC_DISABLE_LEN_CHECK
              value: "true"
            - name: LOGGER_LEVEL
              value: "PRODUCTION" # Options: DEVELOPMENT, PRODUCTION
            - name: GODEBUG
              value: x509sha1=1
            - name: CSI_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: NODEGETINFO_WATCH_TIMEOUT_MINUTES
              value: "1"
          securityContext:
            privileged: true
            capabilities:
              add: ["SYS_ADMIN"]
            allowPrivilegeEscalation: true
          volumeMounts:
            - name: plugin-dir
              mountPath: /csi
            - name: pods-mount-dir
              mountPath: /var/lib/kubelet
              # needed so that any mounts setup inside this container are
              # propagated back to the host machine.
              mountPropagation: "Bidirectional"
            - name: device-dir
              mountPath: /dev
            - name: blocks-dir
              mountPath: /sys/block
            - name: sys-devices-dir
              mountPath: /sys/devices
          ports:
            - name: healthz
              containerPort: 9808
              protocol: TCP
          livenessProbe:
            httpGet:
              path: /healthz
              port: healthz
            initialDelaySeconds: 10
            timeoutSeconds: 5
            periodSeconds: 5
            failureThreshold: 3
        - name: liveness-probe
          image: registry.k8s.io/sig-storage/livenessprobe:v2.10.0
          args:
            - "--v=4"
            - "--csi-address=/csi/csi.sock"
          volumeMounts:
            - name: plugin-dir
              mountPath: /csi
      volumes:
        - name: registration-dir
          hostPath:
            path: /var/lib/kubelet/plugins_registry
            type: Directory
        - name: plugin-dir
          hostPath:
            path: /var/lib/kubelet/plugins/csi.vsphere.vmware.com
            type: DirectoryOrCreate
        - name: pods-mount-dir
          hostPath:
            path: /var/lib/kubelet
            type: Directory
        - name: device-dir
          hostPath:
            path: /dev
        - name: blocks-dir
          hostPath:
            path: /sys/block
            type: Directory
        - name: sys-devices-dir
          hostPath:
            path: /sys/devices
            type: Directory
        # SELinux specific volumes
        - name: selinuxfs 
          hostPath:
            path: /sys/fs/selinux
            type: Directory
        - name: selinux-config
          hostPath:
            path: /etc/selinux/config
            type: FileOrCreate
      tolerations:
        - effect: NoExecute
          operator: Exists
        - effect: NoSchedule
          operator: Exists

bottlerocket-os / bottlerocket

fsGroupChangePolicy appears to be ignored #3151