aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.22k stars 321 forks source link

[EKS] [request]: Speed up cluster autoscale by automatically creating new EKS Optimized AMIs with security patches #1712

Open ssanders1449 opened 2 years ago

ssanders1449 commented 2 years ago

Community Note

Tell us about your request Downloading and installing securty patches can add 45 seconds to the boot time when the Cluster Autoscale adds new nodes. This affect responsiveness to traffic spikes. Therefore, the request is twofold:

1) Add a process which automatically releases new sub-versions of EKS Optimized AMIs containing security patches whenever a new security patch is released that would normally be installed during boot of the original AMI. 2) Add an SSM API that receives an AMI ID and returns the ID of a new AMI that is exactly the same as the original, but has all security patches

Which service(s) is this request for? EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

One of the factors that affect the responsiveness of the Cluster autoscaler is how long it takes new nodes to initialize. Part of the node bootup is checking/download/installation of security patches. If all security patches are included in the AMI, the bootup time is reduced by as much as 45 seconds.

Even though new AMIs are released approximately every 10 days, it is not good enough to just always use the latest recommended AMI for two reasons:

1) New AMI versions contain changes that are not just security related and therefore it is dangerous to automatically take them into production systems without testing (see https://github.com/aws/containers-roadmap/issues/319). However, it should be perfectly safe to take a new AMI that is identical to the original one except for the security patches, since even the original AMI will anyway download/install these same security patches during initial boot.

2) Security patches are released more often than recommended AMIs. In a recent test, I took an AMI that was 6 days old, and when the node booted, it downloaded/installed 4 security patches. When I created a custom AMI that included these 4 patches, node bootup time was reduced by 45 seconds. See attached excerpts from from cloud-init-output.log for the recommended AMI versus the custom AMI

So the request is to: 1) Create an automated procedure that will create new AMIs with a naming convention that includes the original AMI name, plus a patch version suffix. For example: amazon-eks-node-1.21-v20220406-p1, amazon-eks-node-1.21-v20220406-p2, etc

2) Add an SSM API where we can pass in the original AMI name (e.g. amazon-eks-node-1.21-v20220406) and get back the ID/Name of the latest patched AMI which is based on the original AMI. This can be used to automate updating of AutoScaling Groups with the patched AMI ID.

Are you currently working around this issue? We are considering using the techniques described in https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-walk-ami-patching.html to create a lambda that will periodically check for security updates, generate new custom AMIs, and patch the ASG. However, we are probably not the only ones who can benefit from this.

Attachments

custom-cloud-init-output.log recommended-cloud-init-output.log

stevehipwell commented 2 years ago

@ssanders1449 why are you downloading patches on an optimised AMI where the combination of packages have been tested to make sure they work together? This doesn't seem to have any advantages and lots of downsides; AWS already releases new optimised AMIs when there are vulnerabilities.

Do you have any example where the response was too slow for an exploitable vulnerability?

ssanders1449 commented 2 years ago

@ssanders1449 why are you downloading patches on an optimised AMI where the combination of packages have been tested to make sure they work together? This doesn't seem to have any advantages and lots of downsides; AWS already releases new optimised AMIs when there are vulnerabilities.

Do you have any example where the response was too slow for an exploitable vulnerability?

I am not explicitly downloading patches, the Optimized AMI itself is downloading the patches. This is because of the following line in /etc/cloud/cloud.cfg

Removing this line from cloud.cfg significantly improves launch time My suggestion is to remove this line from cloud.cfg and instead of installing the patches at launch time, that the patches be included in updated AMIs

stevehipwell commented 2 years ago

Thanks for the clarification @ssanders1449, this does seem like something that we should have control over. For Bottlerocket you can control if your AMI is upgraded or if you want to replace it.

Do you happen to have any numbers for packages installed from day 0 of an AL2 AMI release to when the next AMI is released?