Easiest/most straightforward way to "cache" some additional, custom Docker images into the AMI Build?

armenr commented 1 year ago

Question:

Is there an easy/simple way to feed a list of additional container images to the AMI builder or builder scripts?

Use-case:

Using something like Keda, we want to rapidly scale "worker" pods up and down, based on some message being published to a queue somewhere in redis or rabbit.

The constraint on our use-case:

We need those pods to scale up as quickly as possible, and our worker images are BIG
Unfortunately there's no way around resizing the images - there's no practical path to making those images smaller, due to the codebase being legacy
~30second latency in queued job processing is acceptable, but when you have an 8GB Docker image to pull, it takes about ~30s for the node to be ready to serve workloads + ~2-3 minutes to pull and create the massive worker image.

The solution I'm contemplating:

Using the EKS-AMI builder, as it is, totally vanilla, but changing only 2 things:

Increase the size of the root volume on the AMI (to accommodate the space needed by our gigantic images)
Pass an additional list of those images to the builder so that it can do that in this stretch of the builder script: https://github.com/awslabs/amazon-eks-ami/blob/e39d71f6832221409cd9990ad85e870f6d621698/scripts/install-worker.sh#L435

Bonus (kind of a feature request)

It would be really cool to allow users/consumers of this repo to have some simple way to pass in a text file, or a comma-separated list of additional images that the "Cache Images" section of the install-worker.sh script 😬

Environment:

AWS Region: NA
Instance Type(s): NA
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): NA
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): NA
AMI Version: NA
Kernel (e.g. uname -a): NA
Release information (run cat /etc/eks/release on a node): NA

Chili-Man commented 1 year ago

Hey @armenr, we ran into this issue as well; We actually ended up baking custom AMIs with container images pre-cached on there, but even though we did that, the performance gains were not realized due to how the root EBS volumes are lazy loaded. So even though we booted new EC2 instances the the pre-cached container images, the root volume parts containing those images would still have to get downloaded on demand, which was effectively the same amount of time (sometimes worse) as if the container image was being downloaded from scratch. See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-initialize.html ;

armenr commented 1 year ago

@Chili-Man ^^ I went down this rabbit-hole 3 weeks ago, after posting here. You are right, and thanks for chiming in. :)

Just like you mentioned, I ran into the same issue. In some cases, certain images were pulled faster from ECR when we compared it to having the node start it from cached images.

I was thinking about this: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-fast-snapshot-restore.html

^^ But it's too complicated and I'd like to keep whatever little hair there is left on my head.

I think, in this case, we're kinda stuck.

armenr commented 1 year ago

@Chili-Man - you think there's a way to put tarballs of images on EFS (or JuiceFS), and then mount that as a volume into the nodes where the pods run, and have that be where the EKS nodes look for/cache/store/use their container images from? 😈

bryantbiggs commented 1 year ago

I think what you might be looking for is https://github.com/awslabs/soci-snapshotter - at least, thats one possible that does not require baking images into the AMI

bryantbiggs commented 1 year ago

xref:

armenr commented 1 year ago

@bryantbiggs - Thank you for sharing this. Sorry for sounding like a child, but could you explain how it solves the issue or fits in?

IF I've understood correctly - we customize the EKS AMI (ourselves, until this is implemented into the default image), ensure that the plugin is being used by containerd, and then launch nodes and see them pull and start images much faster, magically, since they're kinda "streaming" lazily from the ECR repo, AT time of creation?

awslabs / amazon-eks-ami