aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.17k stars 313 forks source link

[Fargate/ECS] [Image caching]: provide image caching for Fargate. #696

Open matthewcummings opened 4 years ago

matthewcummings commented 4 years ago

EDIT: as @ronkorving mentioned, image caching is available for EC2 backed ECS. I've updated this request to be specifically for Fargate.

What do you want us to build? I've deployed scheduled Fargate tasks and been clobbered with high data transfer fees pulling down the image from ECR. Additionally, configuring a VPC endpoint for ECR is not for the faint of heart. The doc is a bit confusing.

It would be a big improvement if there were a resource (network/host) local to the instance where my containers run which could be used to load my docker images.

Which service(s) is this request for? Fargate and ECR.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? I don't want to be charged for pulling a Docker image every time my scheduled Fargate task runs. On that note the VPC endpoint doc should be better too.

Are you currently working around this issue? This was for a personal project, I instead deployed an EC2 instance running a cron job, which is not my preference. I would prefer using Docker and the ECS/Fargate ecosystem.

jtoberon commented 4 years ago

@matthewcummings can you clarify which doc you're talking about ("The doc is horrific")? Can you also clarify which regions your Fargate tasks and your ECR images are in?

matthewcummings commented 4 years ago

@jtoberon can we have these kinds of things in every region? I generally use us-east-1 and us-west-2 these days.

matthewcummings commented 4 years ago

It seems better now https://docs.aws.amazon.com/AmazonECR/latest/userguide/vpc-endpoints.html. It has been updated from what I can see.

However, it still feels like a leaky abstraction. I'd argue that I shouldn't need to know/think about S3 here. Nowhere else in the ECS/EKS/ECR ecosystem do we really see mention of S3.

It would be great if the S3 details could be "abstracted away".

jtoberon commented 4 years ago

Regarding regions, I'm really asking whether you're doing cross-region pulls.

You're right: this is a leaky abstraction. The client (e.g. docker) doesn't care, but from a networking perspective you need to poke a hole to S3 right now.

Regarding making all of this easier, we plan to build cross-region replication, and we plan to simplify the registry URL so that you don't have to think as much about which region you're pulling from. https://github.com/aws/containers-roadmap/issues/140 has more details and some discussion.

matthewcummings commented 4 years ago

Ha ha, thanks. Excuse my snarkiness. . . I am not doing cross-region pulls right now but that is something I may need to do. Thank you!

matthewcummings commented 4 years ago

@jtoberon your call on whether this should be a separate request or folded into the other one.

ronkorving commented 4 years ago

Wait, aren't you really asking for ECS_IMAGE_PULL_BEHAVIOR control?

This was added (it seems) to ECS EC2 in 2018: https://aws.amazon.com/about-aws/whats-new/2018/05/amazon-ecs-adds-options-to-speed-up-container-launch-times/

Agent config docs.

I get the impression Fargate does not give control over that, and does not have it set to prefer-cached or once. This is what we really need, isn't it?

matthewcummings commented 4 years ago

@ronkorving yes, that's exactly what I've requested. I wasn't aware of the ECS/EC2 feature. . . thanks for pointing me to that. However, a Fargate option would be great. I'm going to update the request.

koxon commented 4 years ago

much needed indeed this caching option for fargate

rametta commented 4 years ago

I would like to upvote this feature too. I'm using Fargate at work and our images are ~1GB and it takes very long to start the task because it needs to redownload the image from ECR all the time. If there was some way to cache the image just like the way it's possible for ECS on EC2, then this would be extremely beneficial.

andrestone commented 4 years ago

How's this evolving?

There are many use cases where what you need is just a Lambda with unrestricted access to a kernel / filesystem. Having Fargate with cached / hot images perfectly fits this use case.

fitzn commented 4 years ago

@jtoberon @samuelkarp I realize that this is a more involved feature to build than it was on ECS with EC2 since the instances are changing underneath across AWS accounts, but are you able to provide any timeline on if and when this image caching would be available in Fargate? Lambda eventually fixed this same cold start issue with the short-term cache. This request is for the direct analog in Fargate.

Our use case: we run containers on-demand when our customers initiate an action and connect them to the container that we spin up. So, it's a real-time use case. Right now, we run these containers on ECS with EC2 and the launch times are perfectly acceptable (~1-3 seconds) because we cache the image on the EC2 box with PULL_BEHAVIOR.

We'd really like to move to Fargate but our testing shows our Fargate containers spend ~70 seconds in the PENDING state before moving to the RUNNING state. ECR reports our container at just under 900MB. Both ECR and the ECS cluster are in the same region, us-east-1.

We have to make some investments in the area soon so I am trying to get a sense for how much we should invest into optimizing our current EC2-based setup because we absolutely want to move to Fargate as soon as this cold start issue is resolved. As always, thank you for your communication.

Brother-Andy commented 4 years ago

I wish Fargate could have some sort of caching. Due to lack of environment variables my task just kept falling during all weekend. And every restart meant that new image will be downloaded from docker hub. In the end I've faced with horrible traffic usage, since Fargate had been deployed within private VPC. Of course there is an endpoint (Fargate requires both ECR and S3 as I understood), but still some sort of caching would be much cheaper and predictable option.

pgarbe commented 4 years ago

@Brother-Andy For this use-case, I built cdk-ecr-sync which syncs specific images from DockerHub to ECR. Doesn't solve the caching part but might reduce your bill.

pyx7b commented 4 years ago

Ditto on the feature. We use containers to spin-off cyber ranges for students. Usage can fluctuate from 0 to thousands, Fargate is the best solution for ease of management, but the launch time is a challenge even with ECR. Caching is a much-needed feature.

narzero commented 4 years ago

+1

klatu201 commented 4 years ago

+1

rouralberto commented 4 years ago

Same here, I need to run multiple Fargate cross-region and it takes around a minute to pull the image. Once pulled, the task only takes 4 seconds to run. This completely stops us from using Fargate.

nmqanh commented 4 years ago

we had the same problem, the Fargate task should take only 10 seconds to run but it takes like a minute to pull the I image :(

congthang1 commented 4 years ago

Is that possible to use EFS file system to store image and the task just run this image? Or that is the same question of pulling from EFS to VPS which storing the container?

amunhoz commented 4 years ago

Azure is solving this problem in their plataform https://stevelasker.blog/2019/10/29/azure-container-registry-teleportation/

nakulpathak3 commented 3 years ago

+1 we run a very large number of tasks and 1GB image. This would significantly speed up our deploys and would be a super helpful feature. We're considering moving to EC2 due to Fargate deployment slowness and this is one of the factors.

MattBred commented 3 years ago

Currently using Gitlab Runner Fargate driver which is great, except for the spinup time ~1-2 minutes for our image (> 1gb) because it has to pull it from ECS for every job. Not super great.

Would really like to see some sort of image caching.

alicancakil commented 3 years ago

I have 1GB Containers with no way of reducing the size of it. It takes very long time to start up on fargate.

We really need caching features

SunnyGurnani commented 3 years ago

+1 on this we really need this feature.

ronkorving commented 3 years ago

The amount of time wasted by this not being a thing is no doubt staggering, and continuous to grow as AWS does not address this.

AWS, we could really do with some communication here. I thought that was the point of this repo.

djerraballi commented 3 years ago

This is one of those weird cases where we are paying for poor performance, bandwidth usage + a 3 min image pull on restart/deploy.

mlanner-aws commented 3 years ago

We have work in progress on image pull performance, in particular for images stored in ECR. In the meantime, our metrics and performance testing is showing more consistent image pull performance with platform version 1.4 compared to platform version 1.3, especially looking at p90 and above.

When it comes to image caching specifically, could you expand a little bit on what you would like to see? How would you like control what images should be cached for example?

ronkorving commented 3 years ago

@mlanner-aws Personally, I just want to see quick bootup times in Fargate (which are currently overshadowed by image pull time). I don't have much desire to control much about that, but that may be different for other people on this thread. I just want it to be fast by default.

nakulpathak3 commented 3 years ago

TL;DR Ability to set ECS_IMAGE_PULL_BEHAVIOR to prefer-cached in Fargate. Right now it's always by design limitation of Fargate that we want a workaround for.

When it comes to image caching specifically, could you expand a little bit on what you would like to see?

@mlanner-aws, the expectation I had in mind was that we essentially get EC2-like caching where there is perhaps some common cache that Fargate tasks already have access to and so when they download an image from ECR or otherwise, they are only downloading the Docker layers that have changed since the previous image.

How would you like control what images should be cached for example?

I think any image a task uses (or at least the largest to begin with) would use the above-like functionality where the pull causes a local cache of the image for future pulls. If the image gets completely invalidate by a very early Docker layer invalidation and takes a long time, that's expected and would be the same in EC2 as well.

As @amunhoz pointed out above, Azure has been able to implement this (https://stevelasker.blog/2019/10/29/azure-container-registry-teleportation/).

fbove commented 3 years ago

In our case, it is something pretty similar as what @fitzn described.

Our use case: we run containers on-demand when our customers initiate an action and connect them to the container that we spin up. So, it's a real-time use case. Right now, we run these containers on ECS with EC2 and the launch times are perfectly acceptable (~1-3 seconds) because we cache the image on the EC2 box with PULL_BEHAVIOR.

We'd really like to move to Fargate but our testing shows our Fargate containers spend ~70 seconds in the PENDING state before moving to the RUNNING state. ECR reports our container at just under 900MB. Both ECR and the ECS cluster are in the same region, us-east-1.

As a workaround, we are currently initiating an EC2 from an AMI which has docker installed and our docker image already in it, instead of using Fargate. EC2 startup time is a bit faster than waiting for a Fargate container to start, because of image downloading time consumption.

mreferre commented 3 years ago

@nakulpathak3 the problem is more complex than setting ECS_IMAGE_PULL_BEHAVIOR because, as you noted, the instances backing the tasks are recycled with the task so caching won't apply here. Altering this behavior would have deep ramifications in how Fargate works. We are exploring decoupling the lifecycle of the instances backing the tasks from the storage they use to host the images to achieve this but there are some mechanics that need to be considered for that to work properly. We hear you loud and clear and we would like to solve this asap.

crnkovic commented 3 years ago

Seeing as many people have the same issue where images take too long to pull, let's talk strategies you guys use to reduce this time, at least until AWS addresses this.

I tried to slim down the image as much as I can, but are there any network tips I can use to make downloading faster? I have a very basic VPC with a single public subnet, no inbound security group rules and attached internet gateway. I don't want any inbound access, only outbound so I found this setup to be okay for me.

I've also tried storing the image with ECR instead of Docker Hub, but this does not help reduce the download time for me. It takes around 55 seconds to pull the image (PENDING -> RUNNING) from Docker Hub. Docker Hub reports 300 MB compressed size.

What are some tricks you guys use to reduce the download time?

darrenweiner commented 3 years ago

Fargate caching will really round out the Fargate offering: the FG capacity provider is amazing, but the lack of caching really cuts into the responsiveness of the CP, and the significantly increased pipeline deployment times is a disincentive for a number of my clients to fully adopt FG.

TopherGopher commented 3 years ago

Big +1 to this - we heavily use Fargate and it's a somewhat embarrassing experience to wait for 2 minutes for a 5 second script to run. We've tried Lambda, but due to memory (and time) constraints, we were unable to stick with it.

We would LOVE any caching that could be done.

kristianpaul commented 3 years ago

Caching is definitely something i would like to have and been able to choice but not manage. Love to use in Fargate instead of figuring how that compares with ECS on EC2 or having to deploy custom container solution.

brightshine1111 commented 3 years ago

This issue is made even more salient by Docker implementing the Hub pull rate limits for anonymous and free-tier users. That alone pretty much makes caching with Fargate essential now.

markmelville commented 3 years ago

When it comes to image caching specifically, could you expand a little bit on what you would like to see? How would you like control what images should be cached for example?

@mlanner-aws Sure, we cant expect it to cache every image out there, right? I imagine that it would be configurable per cluster. First of all, any images from any currently running tasks should be in cache and ready to use if the service auto-scales. Next, allow the cluster to specify an ECR repo that could be watched for new pushes, and eagerly cache them. If a new image is pushed it's likely there will soon be a request to launch a task with that image.

pgarbe commented 3 years ago

When it comes to image caching specifically, could you expand a little bit on what you would like to see? How would you like control what images should be cached for example?

@mlanner-aws Actually, I don't want to spend too much time in configuring that cache. Based on the registered task definitions and recently launched tasks, ECS/Fargate should be smart enough to cache the right images.

rafaljanicki commented 3 years ago

I can even do a parameter on the task definition related to required capabilities. But the caching would be awesome, especially that Fargate already adds overhead to the launch time due to the awsvpc network mode when compared to bridge & "classic" ECS

rjpereira commented 3 years ago

I had a case of an task that didn't get stable and I failed to notice it during a month, only to end the month with 15TB of transfers of the exact same image (compounded with costs of TGW and PrivateLink traffic due to design of network). While I can understand the origin of the problem with Fargate, the very minimum of relaunching an image that is being drained should be cached. Any optimisation about not having to download again the layers that are already being run on cluster (or have been running up to x minutes ago) should have an optimized solution.

TopherGopher commented 3 years ago

With regards to the behavior that we would like to see for how caching is handled, I LOVE the idea @markmelville specified, where any ECR repository could configured to be watched from an ECS cluster-wide. Whether that's an automatic watch based on recent images used or a manual configuration, that would rock. A note, we update the same tags over and over (e.g. dev and released), so we would want ECR actions to be the trigger to inform ECS to update the local cache (as opposed to ECS long-polling the repository).

I could see under the covers, creating an FSx for Lustre filesystem and storing the docker cache there. Then attaching that dynamically to whatever node(s) are running your task

pbassut commented 3 years ago

In order to reduce my bill I put a container to run several times a day and shutdown thus saving vcpu and memory costs but now I got data transfer costs in return. I suppose I'll change to EC2 ECS so I can set ECS_IMAGE_PULL_BEHAVIOR until we have at least a workaround.

matthewhegarty commented 3 years ago

I see now that AWS batch supports using Fargate, which is really promising. However, in testing, I can see each job taking ~60s to start, so presumably it is pulling the image each time, and therefore has the same limitations described in this thread. I also wonder if I send 100s of jobs, will they each pull the image every time.

Can anyone on the AWS team confirm this? I cannot see much in the way of documentation which describes in depth how this is working, so any guidance would be much appreciated.

ngodinhnhien commented 3 years ago

Huge +1 to this. We have 4 images whose size is 1.5GB (and promise to become bigger). Currently we have > 50 tasks using these images every day, it costs a pretty penny.

So, really hoping a caching.

subodhiitr commented 3 years ago

+1 to this. This is causing lot of delay in executing the tasks and if we have many tasks that needs to executed sequentially then this time adds up.

amunhoz commented 3 years ago

Just remember that azure is trying to solve this.... and AWS is getting behind. https://channel9.msdn.com/Shows/Azure-Friday/How-to-expedite-container-startup-with-Project-Teleport-and-Azure-Container-Registry

You can apply for the preview in this link: https://aka.ms/azfr/588/02

ghost commented 3 years ago

This would be a great enhancement to the platform. Currently running a 4 CPU 30GB container for a month is cheaper than running a scheduled task that runs every 5 minutes for a couple seconds.

williamcodes commented 3 years ago

@jtoberon any chance of this getting implemented? Hesitant to adopt fargate at all because of the slow boot due to docker pull with no cache.

jtoberon commented 3 years ago

@omieomye ^^^