aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.22k stars 321 forks source link

[ECS] [request]: Improve Cloudformation deploy of ECS service #538

Open ghost opened 5 years ago

ghost commented 5 years ago

Tell us about your request Deploying an ECS service with Cloudformation is very slow today and is very limited in terms of strategy options. It would be great to be able to do more configuration. How should the new task definition be rolled out? What should be the requirements for a task to be seen as Running and okay to move on from? Configure the timeouts. When updating multiple services at once the messages from Cloudformation seems to imply that they are all being updated at once, but when you look at what is actually happening it looks more like they are only being updated one at the time. Either way it would be great to be able to configure a desired behaviour here.

In terms of speed, today when using ECS services, we go to great lengths to avoid having to replace the image in the task definition. Because if the image is to be replaced then the deploy takes ages. We've found that if only the environment of our task definitions change then our deploy (to two different ECS clusters, 6 services each) takes around 13 minutes (this includes testing of code, but the majority of the time is spent updating the stacks). But if the image needs to be replaced then this time increases to 20 minutes. As we think 13 minutes is already way too long, this is just crazy. We are storing the images in ECR in the same region as the cluster so this shouldn't need to take this long.

It would also be great if, like with AutoScaling Groups, you didn't need to specify the DesiredCount. So you can dynamically scale that from somewhere else and not need to worry about deploys resetting the count to whatever is written in the template.

Which service(s) is this request for? Fargate, ECS, ECR

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? We would like a simple, fast and configurable way of deploying while not having to abandon our goal of keeping all infrastructure as code. I feel like I've described the problem areas in the previous section.

Are you currently working around this issue? As I said before we are currently going to great lengths to avoid having to replace images in the task definitions. This is not really solving all the issues, but it is mitigating some of the pain. What we do is that we have a docker image with only our applications dependencies installed, and a shell script that will download the actual application from S3 based on an environment variable. That way we only need to update the image if the dependencies change and when only our own code changes we only need to update an environment variable.

Additional context We deploy each service to production several times a day so it would really be great if the deploys could be sped up. We've found that tweaking different health check settings can speed up deploys immensely when deploying ECS services that have ALBs, it's still not fast, but more configurability like that is what we are looking for. And general speedups. We've also done the basics of having a max percent of 200 and a minimum of 50, but it still isn't very fast. It definitely feels like just running the CLI commands yourself is faster (and you also don't need to wait for all the tasks to be updated, this would incidentally be a great option to be able to configure).

whereisaaron commented 5 years ago

@ulfunnel if your Services are tied to load balancers, check the drain time is as short as is reasonable/possible, so old Tasks can end quickly when you are updating the image.

I agree that compared to EKS, ECS / Fargate Service updates are deadly slow. And even slower with Cloudformation.

ghost commented 5 years ago

@whereisaaron Yes, I know about the load balancer tweaks. Even though those tweaks do massive speedups, the updates are still very slow.

mikelhamer commented 3 years ago

Bump...

pathcl commented 1 year ago

bump