aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.22k stars 321 forks source link

[EKS/Fargate] request: Improve Fargate Node Startup Time #649

Open lilley2412 opened 4 years ago

lilley2412 commented 4 years ago

Community Note

Tell us about your request Improve the startup time of Fargate nodes on EKS

Which service(s) is this request for? EKS Fargate

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? I haven't done extensive benchmarks yet, but anecdotal "kicking the tires" of Fargate for EKS shows 30-45 seconds for a node to be created and registered; because it's node-per-pod, I then have to wait for image pull and container start time, so in total it's taking over a minute to start a new pod.

This is problematic for obvious reasons; for some use-cases it's not a show-stopper, like HPA-based scaled deployments I'm OK with the startup time. For others, like a CI cluster for gitlab, the startup time is painful; each CI job spawns a new pod which takes "forever".

Are you currently working around this issue? Currently just eating the startup time.

unthought commented 4 years ago

Same usecase. Would love to migrate some Github Action workloads to EKS/Fargate for ease of integration, but current pod boot time is a showstopper. In our case it's even more exacerbated, because we'd use Argo Workflows, which launches a pod per workflow step. And Github Action VMs launch incredibly fast, we even had to abandon AWS CodeBuild because of it's VM launch time delay, especially on medium and large instances.

jgoeres commented 4 years ago

Same here - we are trying to put spark jobs on EKS/Fargate. For long-running spark jobs, this is not a big deal, but to streamline our stack, we also have many shorter spark jobs, for which this is effectively unacceptable.

booleanbetrayal commented 4 years ago

Is this an issue with AWS not having the appropriate EC2 capacity ready and warmed ahead of time for node spooling? Certainly feels like it could be an EC2 initialization happening behind the scenes during each and every pod deploy.

mreferre commented 4 years ago

@booleanbetrayal this is not about having a EC2 instances up and running beforehand. There is a lot of mechanics happening behind the scenes that adds up (e.g. connecting the ENI to the customer VPC, etc). Also, Kubernetes doesn't really have a "serverless" mode of operation so the instance we use needs to be initialized with the Kubernetes client code, it needs to virtually connect to the cluster and show up as a fully initialized worker node and only then the pod can be scheduled on it. So while we try to make the user experience as "serverless-ly" as possible, the mechanic behind the scene is more complex than taking a running EC2 to deploy the container. We do appreciate that for a long running task this initialization isn't a big deal but for short running tasks it adds up. We are working hard to reduce the time it takes for these steps to execute and/or removing some of these steps (where possible).

booleanbetrayal commented 4 years ago

Thanks for the clarification @mreferre !

spicysomtam commented 4 years ago

It seems very inefficient. 1 pod = 1 fargate node? I was just testing this and it takes 60s in eu-west-1 to find a fargate node to run a pod on and then it starts running. If you are used to the normal pod spin up on k8s its a second or less depending on whether the node has the image cached. Would it not be better to implement a virtual kubelet for a node which can then spin up lots of pods and thus run much quicker? Or am I missing something and doing something terribly wrong? I did get caught out by the fargate private subnets needing a NAT route to get out to the internet to pull images. I'll continue my testing to see if I can get fargate running faster!

mreferre commented 4 years ago

@spicysomtam I think 50/60 seconds is just about right and you won't be able to reduce it substantially (also, it depends on the image size but even tiny images won't take less than 45 seconds because of everything above). We are working to reduce this timing over time. If all you need is fast single pod startup time than EKS managed node groups are a great solution. Fargate's value isn't in single pod start time but rather in the additional security that running pods in dedicated os kernels bring and the fact you no longer have to manage/scale/life-cycle nodes (you can read more here). A couple of years ago we did look into the virtual Kubelet project to run Fargate pods but 1) this won't have a different effect on pod startup time given all the virtual Kubelet does is proxy requests to a backend (Fargate in this case) and the timing experience would have been similar and 2) virtual Kubelet is a (heavy) fork of the Kubelet and we did not want to go down that path.

spicysomtam commented 4 years ago

I did testing with node groups and Cluster Autoscaling (CA) and the pod spin up times are what I am used with k8s (most of my experience is with Openshift and Rancher). The only slight issue, which isn't AWS related, is spin up time for nodes via the CA when pods are stuck in pending, but you could work around that via placement pods with a lower pod priority; k8s would kill these and replace them with your real pods pending and in the background a new node would be spun up.

mreferre commented 4 years ago

Sure. If you want to over-index on optimizing pods startup times and you are not constrained by the additional costs of idle resources that is the right thing to do. With Fargate we aim at removing a lot of undifferentiated heavy lifting associated to managing the underlying infrastructure but, depending on your use case and objectives, it may not (yet) be the right service for you. We are working on this.

nyouna commented 3 years ago

we are experiencing the same issue with Fargate in ECS... It doesn't seems to be exclusively related to EKS.

nyouna commented 3 years ago

@mikestef9 can you please add ECS label?

nalbury commented 3 years ago

Just tried using pod priority to work around this for gitlab runners. Unfortunately, it looks like Fargate is overriding all pod priority settings with system-node-critical so unless I'm missing something, this isn't viable work around.

Understandable, the idea of spare/idle capacity in a Serverless env doesn't really make sense in hindsight.

Not blaming Fargate here, still a great tool for other workloads, just hoping to save other folks some time.

mreferre commented 3 years ago

Just tried using pod priority to work around this for gitlab runners. Unfortunately, it looks like Fargate is overriding all pod priority settings with system-node-critical so unless I'm missing something, this isn't viable work around.

Understandable, the idea of spare/idle capacity in a Serverless env doesn't really make sense in hindsight.

Not blaming Fargate here, still a great tool for other workloads, just hoping to save other folks some time.

Yes. Pod priorities make sense when multiple pods need to compete for the same node resources (the priority is used to disambiguate which pods are more important than other and hence which should "win"). In the context of Fargate this is not a problem because the "node" is just a "second class citizen" (or rightly sized dedicated capacity used to fulfill all pods requests in a 1:1 setup).

almson commented 3 years ago

Fargate is a bad solution. The pod-per-vm concept kills the point of containers.

In my opinion the ideal solution would be something closer to cluster-autoscaler which ran in the control plane (so, didn't have to be installed and worked when there's no nodes) and launched instances directly so that one didn't have the added complexity of node groups. I think that besides lower user-facing complexity, getting rid of ASGs would probably simplify the implementation logic a great deal too.

mreferre commented 3 years ago

Fargate is a bad solution. The pod-per-vm concept kills the point of containers.

In my opinion the ideal solution would be something closer to cluster-autoscaler which ran in the control plane (so, didn't have to be installed and worked when there's no nodes) and launched instances directly so that one didn't have the added complexity of node groups. I think that besides lower user-facing complexity, getting rid of ASGs would probably simplify the implementation logic a great deal too.

I think there is a need for both. To protect from the (relatively frequent) containers escapes the notion of a dedicated VM/microVM per pod is gaining A LOT of traction in the industry (especially in the highly regulated segments of the industry). Then for those situations where this is not required and/or there is a desire to have a more classic "cluster of multi-tenant nodes" all you said makes a lot of sense and there is work being done to address it.

almson commented 3 years ago

Fargate is a bad solution. The pod-per-vm concept kills the point of containers. In my opinion the ideal solution would be something closer to cluster-autoscaler which ran in the control plane (so, didn't have to be installed and worked when there's no nodes) and launched instances directly so that one didn't have the added complexity of node groups. I think that besides lower user-facing complexity, getting rid of ASGs would probably simplify the implementation logic a great deal too.

I think there is a need for both. To protect from the (relatively frequent) containers escapes the notion of a dedicated VM/microVM per pod is gaining A LOT of traction in the industry (especially in the highly regulated segments of the industry). Then for those situations where this is not required and/or there is a desire to have a more classic "cluster of multi-tenant nodes" all you said makes a lot of sense and there is work being done to address it.

The description of Fargate makes clear its aim:

AWS Fargate is a technology that provides on-demand, right-sized compute capacity for containers. With AWS Fargate, you don't have to provision, configure, or scale groups of virtual machines on your own to run containers.

If what you're interested in is isolation, then containerd-firecracker is a better approach that's agnostic to how your nodes are scaled. Fargate is not a microVM. It's a full VM that runs kubelet, etc, and has more attack surface than a microVM hosting a pod. It's also less efficient, takes longer to launch, etc.

mreferre commented 3 years ago

If what you're interested in is isolation, then containerd-firecracker is a better approach that's agnostic to how your nodes are scaled. Fargate is not a microVM. It's a full VM that runs kubelet, etc, and has more attack surface than a microVM hosting a pod. It's also less efficient, takes longer to launch, etc.

Yes. The Kubelet is part of the VM/microVM and that is what makes the cluster the security boundary as described in the docs. Using a microVM to shield the pod only (leeaving the kubelet outside is an alternative but the key point here is that won't solve the problem that you'd need to provision a node on the fly to connect to the cluster when the user deploys the pod (if we want to stick to the "pay per pod" model).

almson commented 3 years ago

the key point here is that won't solve the problem that you'd need to provision a node on the fly to connect to the cluster when the user deploys the pod (if we want to stick to the "pay per pod" model).

Who wants to stick to the pay-per-pod model? A typical user would lose a lot of money on Fargate because each node has to be sized to a pod's resource limit (ie, overprovisioned for peak load), while a multi-tenant node is sized to the sum of the pods' resource requests (which is typically a lot smaller) and each pod can "burst" as much as it wants. You also lose money on the coarse granularity of Fargate resource requests.

Saeger commented 3 years ago

the key point here is that won't solve the problem that you'd need to provision a node on the fly to connect to the cluster when the user deploys the pod (if we want to stick to the "pay per pod" model).

Who wants to stick to the pay-per-pod model? A typical user would lose a lot of money on Fargate because each node has to be sized to a pod's resource limit (ie, overprovisioned for peak load), while a multi-tenant node is sized to the sum of the pods' resource requests (which is typically a lot smaller) and each pod can "burst" as much as it wants. You also lose money on the coarse granularity of Fargate resource requests.

Doesn't Fargate sound more efficient than provisioning your own resources via autoscaler since you can't always predict workloads?

almson commented 3 years ago

@Saeger You don't need to predict workloads... you're autoscaling...

mreferre commented 3 years ago

@almson it's not just about the pay-per-pod-model. Flexible resource usage is an area where standard nodes can help but people like Fargate because everything that you do not need to think about when deploying a pod (and surely Fargate could do better here). IMO we are all so used to deploy a cluster of EC2 instances to launch pods/tasks that we lose sight that, to launch EC2 instances, you don't need to manage a rack of physical servers.

Saeger commented 3 years ago

@Saeger You don't need to predict workloads... you're autoscaling...

I think this is easier to say and harder to accomplish. Autoscaling tends to end-up over-provisioning resources. But maybe my workloads faces too many corner cases for the discussion here. I see pros and cons from both approaches tbh.

mreferre commented 3 years ago

I also want to be clear that I am not trying to be dismissive/defensive here. @almson you are bringing a ton of good feedback and perspective to the table. I agree with @Saeger that the world has nuances and there are different needs that need to be tackled with different approaches.

forresthopkinsa commented 2 years ago

I've been wanting to use ECS + Fargate for a scale-to-zero web app, but with a 60-90s startup time I'm realizing this approach is probably infeasible.

Edit - Reading around a little more, it seems that I was operating on incorrect assumptions to begin with: #1017

project0 commented 2 years ago

two and a half years later and this is still a issue. Fargate with EKS could have so great potential, but with the slow scaling capability its a real bummer.

spicysomtam commented 2 years ago

two and a half years later and this is still a issue. Fargate with EKS could have so great potential, but with the slow scaling capability its a real bummer.

That is because fargate uses a EC2 vm which is slow and not a container. AWS is vm centric. Try gcp gke; its much better than eks.

project0 commented 2 years ago

That is because fargate uses a EC2 vm which is slow and not a container. AWS is vm centric.

This is not entirely true. AWS developed firecracker, a microVM hypervisor. Technically nothing is stopping them from starting fargate pods within seconds. Even ec2 instances starts much faster. I can only imagine some weird or bad service in between delays scheduling.

AffiTheCreator commented 2 years ago

I'm working in a product migration into AWS, and I'm using fargate. The docker image is around 5 GB, and it takes more than 3 min to get the container running. I don't reach anywhere close to the reported 45s (that sounds like a dream).

Is it possible to use ec2 containers instead of fargate and not have to deal with the startup delay?

mreferre commented 2 years ago

I'm working in a product migration into AWS, and I'm using fargate. The docker image is around 5 GB, and it takes more than 3 min to get the container running. I don't reach anywhere close to the reported 45s (that sounds like a dream).

Is it possible to use ec2 containers instead of fargate and not have to deal with the startup delay?

The 45 second is how much it takes (more or less) to prepare the infrastructure and start the pull of the image. 5GB is an immense image and it's not surprising that it takes that long. You could reduce that by configuring larger pods (because they will land on larger instances with more cpu/network capacity) but there is a cost associated to that and you probably don't want to spend more of what you'd need at run-time "just" to speed up the start-time. If your start-time is very much skewed towards the image size you may want to keep an eye on other work we are doing in this area (e.g. here: https://github.com/aws/containers-roadmap/issues/696).

To answer your question, yes absolutely you can use EC2 (typically an EKS managed node group) to deploy your pods. As long as your nodes don't churn too much (scaling in and scaling out) they can cache images lowering drastically start-times.

AffiTheCreator commented 2 years ago

The product works with real-time video input so the startup time is a must since we need to be able to process the video on demand. Because we work with video the pod settings are already one of the beefiest containers aws offers - 4vCPU and 8GB RAM.

I took a look at #696 and I might implement some ideas, thank you for the link.

Regarding the EC2 containers answer, we have implemented a sort of container recycle cycle within the product but there is a tradeoff between time and cost associated to the infrastructure. There is only so much time we can keep the container running before it becomes more expensive than to launch a new task.
"As long as your nodes don't churn too much (scaling in and scaling out) " - the problem is that's what the product is supposed to do

The caching solution would indeed solve our issues, do you have a timeline for when we should expect this feature? I ask this because I might need to change the infrastructure if it’s going to take a long time.

mreferre commented 2 years ago

I am intrigued by your use case @AffiTheCreator. We don't have a specific timeline for solutions around the ask in https://github.com/aws/containers-roadmap/issues/696. We don't have a timeline we can share but we are probably going to introduce a number of mitigations over time to achieve the goal of reducing the task start time (think of caching the image as a mechanism and not the goal). Please subscribe to the issue 696 for updates on this front.

mreferre commented 2 years ago

@AffiTheCreator can you reach out to me privately. I wanted to ask a few things. You can reach me on Twitter (https://twitter.com/mreferre) or via email at the same account (at amazon dot com). Thanks!

AffiTheCreator commented 2 years ago

I have emailed you to the one register in your GitHub account (massimo@it20.info)

@mreferre

sanderjochems-whyellow commented 1 year ago

Were there any improvements in the last year?

jcputney commented 9 months ago

@sanderjochems-whyellow sure doesn't feel like it, still seeing 45-60 second start times. Not a big deal for me since I'm just autoscaling a Selenium Grid, but I can see how others would have a really hard time with that much startup lag.

handt-dev commented 4 months ago

Hi everyone, Is there any solution for this issue after 5 years ?