SELinux causing high boot time for eks worker node

sameerjain1995 commented 1 year ago

Environment: production

AWS Region: ap-south-1
Instance Type(s): m5a.2xlarge
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): 1.24
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.24
AMI Version: v1.24.10-eks-48e63af
Kernel (e.g. uname -a): 5.10.165-143.735.amzn2.x86_64
When we enable selinux on our custom ami worker node takes 6-7 mins to get in ready state and pod to start scheduling on it, but when we disable selinux then ami is ready in 1-2 min. Could you please let me know why selinux is causing this delay and how can we reduce this delay

nafhn commented 1 year ago

I'm seeing the same symptoms using a nearly identical setup.

Something I noticed looking through system logs on the node is that cluster networking seems to come up quite slowly for the cluster with SELinux enabled.

cartermckinnon commented 1 year ago

Interesting. Do you see the same behavior on other distros, like the EKS Ubuntu AMI?

wvidana commented 4 months ago

I think this is still happening. We recently moved our images to the CIS hardened ones (which are based off the eks-optimized image on AL2) https://aws.amazon.com/marketplace/pp/prodview-kfjezhuetoa3e

Our cloud-init times went from 20s on eks-optimized to 300s on the CIS image. We even tried warm-pools initializing the EBS volume, but it only went down to 250s. One of the things that we noticed is that the eks-optimized images have SELinux disabled and the CIS images have it enabled.

Any ideas on how to speed up startup? most of the time is lost on the /etc/eks/bootstrap.sh script, specifically at the end of the script, where it is creating files and symlinking, etc... (Only pasting the last part of the logs since the first part took less than 1 minute, and this part took 4)

2024-04-26T01:10:00+0000 [private-dns-name] INFO: retrieved PrivateDnsName: ip-10-30-52-31.ec2.internal
+ echo ip-10-30-52-31.ec2.internal
+ exit 0
‘/etc/eks/containerd/containerd-config.toml’ -> ‘/etc/containerd/config.toml’
‘/etc/eks/containerd/sandbox-image.service’ -> ‘/etc/systemd/system/sandbox-image.service’
Created symlink from /etc/systemd/system/multi-user.target.wants/containerd.service to /usr/lib/systemd/system/containerd.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/sandbox-image.service to /etc/systemd/system/sandbox-image.service.
‘/etc/eks/containerd/kubelet-containerd.service’ -> ‘/etc/systemd/system/kubelet.service’
Created symlink from /etc/systemd/system/multi-user.target.wants/kubelet.service to /etc/systemd/system/kubelet.service.
2024-04-26T01:14:10+0000 [eks-bootstrap] INFO: complete!

cartermckinnon commented 4 months ago

I think this should be improved by https://github.com/awslabs/amazon-eks-ami/pull/1773, there's some kind of issue with sudo during cloud-init when SELinux is enabled, it causes sudo to take >20 seconds per invocation.

cartermckinnon commented 3 weeks ago

Going to close this, #1773 seemed to do the trick 👍

awslabs / amazon-eks-ami

SELinux causing high boot time for eks worker node #1394