Open stk0vrfl0w opened 1 year ago
Hi stk0vrfl0w, thanks for the report!
This seems very similar to #2656 which was fixed earlier this year as part of the 1.12.0 release. If you could, would you confirm if the fix that @bcressey describes here works for you?
Thanks for the pointer, @rpkelly! If I understand the solution correctly, it seems that we'd need to migrate the AWS EBS CSI driver from being an EKS managed add-on to either (a) deploying it via a kustomized
helm chart or (b) patching after upgrades. This is because the chart does not (yet) support adding custom mount points.
Hey @stk0vrfl0w! I think you are correct, the current situation is that patching the AWS EBS CSI driver is the most accessible option atm, I cut https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1544 which will hopefully add proper support to the driver so you don't need to patch it. It would be good to confirm this SELinux-specific fix solves your problem. It might not be the complete fix but I think this patch would reduce your attach time by a significant amount so I wanted to have you try that first. There might also be a recursive chown, but the SELinux relabel is known to cause the symptoms you are experiencing when volumes have lots of files.
I'll try out the suggestions as soon as I can. Unfortunately, since all the environments where it affects us are production, I can't perform any testing there. So, I'm working on getting approvals for setting up a testing environment.
I was finally able to set up a test environment, but the SELinux specific fix doesn't appear to have addressed the issue.
As an added wrinkle -- even if the SELinux fix did work, our environments use a small EKS managed node group whose AMIs aren't running SELinux in enforcement. On those hosts, the fix prevents the CSI driver from properly running because the expected directories/files are missing.
Hi @stk0vrfl0w,
Can you please describe in more detail what steps you took to apply the SELinux skip relabeling fix? The description in https://github.com/bottlerocket-os/bottlerocket/issues/2656#issuecomment-1408912457 might not be super clear.
Without the fix, I was able to observe the long pod start-up times with an EBS volume with a few million files. After applying the SELinux mount fix, the pod start up time went from 10 mins for 2 million files down to 3 seconds.
Two million files on the attached volume
bash-4.4$ df -hi ./
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/nvme3n1 2.1M 2.1M 0 100% /data
Before SELinux mounts in the EBS node agent pod:
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2023-07-12T23:42:47Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2023-07-12T23:51:50Z"
status: "True"
type: Ready
After SELinux mounts in the EBS node agent pod:
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2023-07-13T19:13:47Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2023-07-13T19:13:50Z"
status: "True"
type: Ready
Here are the steps I did to get the EBS node agent to skip SELinux relabeling: The latest EBS CSI driver release lets you customize the volume mounts in the CSI node pods via the helm chart.
Create a values.yaml
file based on https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/charts/aws-ebs-csi-driver/values.yaml and edit volumes
and volumeMounts
following under the node
section:
# Add additional volume mounts on the node pods with node.volumes and node.volumeMounts
volumes:
- hostPath:
path: /sys/fs/selinux
type: Directory
name: selinuxfs
- hostPath:
path: /etc/selinux/config
type: FileOrCreate
name: selinux-config
# Add additional volumes to be mounted onto the node pods:
# - name: custom-dir
# hostPath:
# path: /path/to/dir
# type: Directory
volumeMounts:
# SELinux specific volume mounts
- mountPath: /sys/fs/selinux
name: selinuxfs
- mountPath: /etc/selinux/config
name: selinux-config
aws-ebs-csi-driver
helm chart:
helm upgrade --install aws-ebs-csi-driver \
--namespace kube-system \
aws-ebs-csi-driver/aws-ebs-csi-driver -f values.yaml
ebs-csi-node
daemonset and ensure the mounts are there:
$ kubectl describe ds/ebs-csi-node -n kube-system | grep selinux -A 5
/etc/selinux/config from selinux-config (rw)
/sys/fs/selinux from selinuxfs (rw)
/var/lib/kubelet from kubelet-dir (rw)
node-driver-registrar:
Image: public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar:v2.8.0-eks-1-27-7
Port: <none>
Host Port: <none>
--
selinuxfs:
Type: HostPath (bare host directory volume)
Path: /sys/fs/selinux
HostPathType: Directory
selinux-config:
Type: HostPath (bare host directory volume)
Path: /etc/selinux/config
HostPathType: FileOrCreate
Priority Class Name: system-node-critical
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ebs-sc
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
parameters:
fsType: ext4
type: gp3
mountOptions:
- context="system_u:object_r:local_t:s0"
Please try that out and let us know if it works for you.
Regarding your concern here:
As an added wrinkle -- even if the SELinux fix did work, our environments use a small EKS managed node group whose AMIs aren't running SELinux in enforcement. On those hosts, the fix prevents the CSI driver from properly running because the expected directories/files are missing.
A possible solution here would be to apply taints to your nodes so that you can apply a Bottlerocket-specific EBS CSI node daemonset and a non-Bottlerocket specific EBS CSI node daemonset.
I just wanted to follow up on this issue since heard offline that folks might be using this for solving slow attachment issues with lots of tiny files. @etungsten has an excellent write up on how to get this to work but I've noticed in some situations the mountOptions
might need to be specified at the persistent volume rather (or in additional) to the storage class. If the storage class alone doesn't work for you, adding it to the persistent volume looks to solve the problem. @stk0vrfl0w do you know if these steps solved your original concern?
It looks like the mountOptions are copied from the StorageClass to the PV upon creation. So if the PV already exists, you need to manually update the PV. New PVs will already have the option.
@etungsten Thank you! Same applies to vsphere-csi-driver in combination with eks anywhere.
kind: DaemonSet
apiVersion: apps/v1
metadata:
name: vsphere-csi-node
namespace: vmware-system-csi
spec:
selector:
matchLabels:
app: vsphere-csi-node
updateStrategy:
type: "RollingUpdate"
rollingUpdate:
maxUnavailable: 1
template:
metadata:
labels:
app: vsphere-csi-node
role: vsphere-csi
spec:
priorityClassName: system-node-critical
nodeSelector:
kubernetes.io/os: linux
serviceAccountName: vsphere-csi-node
hostNetwork: true
dnsPolicy: "ClusterFirstWithHostNet"
containers:
- name: node-driver-registrar
image: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.8.0
args:
- "--v=5"
- "--csi-address=$(ADDRESS)"
- "--kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)"
env:
- name: ADDRESS
value: /csi/csi.sock
- name: DRIVER_REG_SOCK_PATH
value: /var/lib/kubelet/plugins/csi.vsphere.vmware.com/csi.sock
volumeMounts:
- name: plugin-dir
mountPath: /csi
- name: registration-dir
mountPath: /registration
# SELinux specific volume mounts
- mountPath: /sys/fs/selinux
name: selinuxfs
- mountPath: /etc/selinux/config
name: selinux-config
livenessProbe:
exec:
command:
- /csi-node-driver-registrar
- --kubelet-registration-path=/var/lib/kubelet/plugins/csi.vsphere.vmware.com/csi.sock
- --mode=kubelet-registration-probe
initialDelaySeconds: 3
- name: vsphere-csi-node
image: gcr.io/cloud-provider-vsphere/csi/release/driver:v3.1.2
args:
- "--fss-name=internal-feature-states.csi.vsphere.vmware.com"
- "--fss-namespace=$(CSI_NAMESPACE)"
imagePullPolicy: "Always"
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: CSI_ENDPOINT
value: unix:///csi/csi.sock
- name: MAX_VOLUMES_PER_NODE
value: "59" # Maximum number of volumes that controller can publish to the node. If value is not set or zero Kubernetes decide how many volumes can be published by the controller to the node.
- name: X_CSI_MODE
value: "node"
- name: X_CSI_SPEC_REQ_VALIDATION
value: "false"
- name: X_CSI_SPEC_DISABLE_LEN_CHECK
value: "true"
- name: LOGGER_LEVEL
value: "PRODUCTION" # Options: DEVELOPMENT, PRODUCTION
- name: GODEBUG
value: x509sha1=1
- name: CSI_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: NODEGETINFO_WATCH_TIMEOUT_MINUTES
value: "1"
securityContext:
privileged: true
capabilities:
add: ["SYS_ADMIN"]
allowPrivilegeEscalation: true
volumeMounts:
- name: plugin-dir
mountPath: /csi
- name: pods-mount-dir
mountPath: /var/lib/kubelet
# needed so that any mounts setup inside this container are
# propagated back to the host machine.
mountPropagation: "Bidirectional"
- name: device-dir
mountPath: /dev
- name: blocks-dir
mountPath: /sys/block
- name: sys-devices-dir
mountPath: /sys/devices
ports:
- name: healthz
containerPort: 9808
protocol: TCP
livenessProbe:
httpGet:
path: /healthz
port: healthz
initialDelaySeconds: 10
timeoutSeconds: 5
periodSeconds: 5
failureThreshold: 3
- name: liveness-probe
image: registry.k8s.io/sig-storage/livenessprobe:v2.10.0
args:
- "--v=4"
- "--csi-address=/csi/csi.sock"
volumeMounts:
- name: plugin-dir
mountPath: /csi
volumes:
- name: registration-dir
hostPath:
path: /var/lib/kubelet/plugins_registry
type: Directory
- name: plugin-dir
hostPath:
path: /var/lib/kubelet/plugins/csi.vsphere.vmware.com
type: DirectoryOrCreate
- name: pods-mount-dir
hostPath:
path: /var/lib/kubelet
type: Directory
- name: device-dir
hostPath:
path: /dev
- name: blocks-dir
hostPath:
path: /sys/block
type: Directory
- name: sys-devices-dir
hostPath:
path: /sys/devices
type: Directory
# SELinux specific volumes
- name: selinuxfs
hostPath:
path: /sys/fs/selinux
type: Directory
- name: selinux-config
hostPath:
path: /etc/selinux/config
type: FileOrCreate
tolerations:
- effect: NoExecute
operator: Exists
- effect: NoSchedule
operator: Exists
Image I'm using:
ami-0f99e88195df133a4
-- EKS 1.23What I expected to happen: When setting the following securityContext for a pod, I expected that a recursive chown operation on the attached EBS volume would not happen. Both AL2 & Ubuntu EKS-tuned AMIs appear to properly respect the setting.
What actually happened: All of a pods containers remain stuck in the initializing state until a recursive chown on the attached persistent volume is completed -- at which point the containers will then begin their Initialization. For volumes with a large amount of files, this can take a significant amount of time.
As an example, we run Jenkins on several clusters -- which stores state as lots of small files on disk. Our average deployment has anywhere between 10 - 20 million files on disk which takes 45 minutes or more to complete the chown process when using the BottleRocket AMIs chosen by Karpenter. As stated before, switching the AMI to be Ubuntu or AL2 properly respects the setting and the volume is available for use after ~ 10 seconds.
How to reproduce the problem: Create a PVC that's attached to a Deployment or StatefulSet and generate a million small files on the volume. Forcing a restart of the deployment or statefulset should result in a few minute delay while waiting for the chown to complete.