aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.21k stars 316 forks source link

[EKS] [request]: Ability to configure pod-eviction-timeout #159

Open ChrisCooney opened 5 years ago

ChrisCooney commented 5 years ago

Tell us about your request I would like to be able to make changes to configuration values for things like kube-controller. This enables a greater customisation of the cluster to specific, bespoke needs. It will also go a long way in making the cluster more resilient and self-healing.

Which service(s) is this request for? EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

At present, we have a cluster managed by EKS. The default pod-eviction-timeout is five minutes, meaning that we can derail an instance and the control plane won't reschedule for five minutes. Five minute outages for things like our payment systems is simply unacceptable - the cost impact would be severe. At present, to the best of my knowledge, the control plane is not configurable at all.

What we would like to be able to do is provide configuration parameters via the AWS API or within a Kubernetes resources like a ConfigMap. Either or would mean, when we bring up new EKS clusters, we can automate the configuration of values like pod-eviction-timeout.

Are you currently working around this issue? No, to the best of my knowledge, it isn't something that EKS presently supports.

tabern commented 5 years ago

Thanks for submitting this Chris. At present, the 5 minute timeout is the default for Kubernetes. We’re evaluating adding additional configuration parameters onto the control plane and have added this to our list of parameters to research exposing for customization on a per-cluster basis.

ChrisCooney commented 5 years ago

Hi @tabern , thanks for the response. Yes, I'm aware of the Kubernetes default. A large portion of those running K8s in production have actively tweaked these values and I worry this would be a barrier to EKS supporting some of our more critical applications.

Glad to hear this is being evaluated and look forward to seeing where it goes.

tabern commented 5 years ago

@ChrisCooney sounds good. We're going to look into this. I've updated the title of your request to specifically address this ask so we can track it.

BrianChristie commented 5 years ago

To add another use case: We also wish to be able to adjust pod-eviction-timeout, specifically to facilitate the use of Spot Instances. In the case that an instance is terminated without the running Pods being properly evicted, we want a short timeout before those Pods are rescheduled elsewhere.

Thanks!

dawidmalina commented 5 years ago

Ideally we should be also able to tune:

--node-monitor-period
--node-monitor-grace-period
geerlingguy commented 5 years ago

I would also very much like to have control over HPA scaling delays since there's no other way to do it:

--horizontal-pod-autoscaler-downscale-delay
--horizontal-pod-autoscaler-upscale-delay
whereisaaron commented 5 years ago

@BrianChristie BTW, if you like you can monitor for spot node terminator and evict the pods cleanly before termination.

savar commented 5 years ago

also --horizontal-pod-autoscaler-cpu-initialization-period and --horizontal-pod-autoscaler-downscale-stabilization as if one of hour hpa is failing miserably a second one actually only scales within the CPU utilization but as they are limited and only can go up to almost twice the "wished" target, we only can scale up by 2 each run (https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#algorithm-details) which means with 16 pods running we only grow to 32.. and then it takes 5mins before it scales to 64 and then another 5mins to 128.. if the other HPA which is failing at that time had 800 pods running and is dropping to 300, then it takes like ages to cover the missing 500 pods

echoboomer commented 5 years ago

Are there plans to allow passing in any amount of parameters from something like https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/ (specifically --terminated-pod-gc-threshold) or is the plan to only allow customizing certain parameters?

eladitzhakian commented 5 years ago

Could also use the ability to modify

--horizontal-pod-autoscaler-use-rest-clients

Since I'm having problems with HPA and metrics-server and can't view or configure it

mebuzz commented 5 years ago

Looks like more and more people adapting k8s on eks are in urgent need of these customizations. Specifically the one already mentioned,
--horizontal-pod-autoscaler-downscale-delay --horizontal-pod-autoscaler-upscale-delay and --pod-eviction-timeout

Unable to meet worker nodes patching requirements. (although draining helps a little, but not enough to comply)

ghost commented 5 years ago

Actually 5 minute is sometimes too long to delete pods on failed nodes. --pod-eviction-timeout duration should be enabled on EKS too.

chillybug commented 4 years ago

I really need to set below one! --horizontal-pod-autoscaler-upscale-delay

gillbee commented 4 years ago

Any updates? We're also looking for the ability to configure these values.

PaulMaddox commented 4 years ago

As an interim workaround, instead of using --pod-eviction-timeout, can you use Taint Based Evictions to set this on a per-pod basis? This is supported in EKS clusters running 1.13+.

There's an example in this issue: https://github.com/kubernetes/kubernetes/issues/74651

echoboomer commented 4 years ago

Not sure if this works for everybody or everything but I recently noticed this in the AWS EKS node AMI:

https://github.com/awslabs/amazon-eks-ami/blob/master/files/kubelet.service#L14

Notice the use of $KUBELET_ARGS $KUBELET_EXTRA_ARGS here - we were able to pass in my original requirement of --terminated-pod-gc-threshold this way, but I'm not entirely certain that a) AWS honors things placed here or b) these work with master-node abstraction.

ChrisCooney commented 4 years ago

Not sure if this works for everybody or everything but I recently noticed this in the AWS EKS node AMI:

https://github.com/awslabs/amazon-eks-ami/blob/master/files/kubelet.service#L14

Notice the use of $KUBELET_ARGS $KUBELET_EXTRA_ARGS here - we were able to pass in my original requirement of --terminated-pod-gc-threshold this way, but I'm not entirely certain that a) AWS honors things placed here or b) these work with master-node abstraction.

Yeah, this means you can configure the Kubelet on the node. Alas, it doesn't allow us to configure the kubernetes control plane.

shivarajai commented 4 years ago

can you allow the ability to modify the below flags for the kube-controller-manager fo us to be able to manage the col down delay aside from the default 5 minutes: --horizontal-pod-autoscaler-downscale-delay --horizontal-pod-autoscaler-upscale-delay

jicowan commented 4 years ago

you could use this instead, https://blog.postmates.com/configurable-horizontal-pod-autoscaler-81f48779abfc

starchx commented 4 years ago

Add:

--terminated-pod-gc-threshold

calebwoofenden commented 4 years ago

Jumping in to request that --horizontal-pod-autoscaler-initial-readiness-delay also be added. We are running an HPA in our EKS clusters and are unable to fully configure it how we would like.

I'm not sure why kube chose to have all of these HPA-related configs go on the controller manager instead of being configured on the HPA resource itself, but that's another story.

mikestef9 commented 4 years ago

Note that 1.18 adds support configurable scaling behavior

https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#support-for-configurable-scaling-behavior

So this will be possible once EKS supports 1.18

danijelk commented 3 years ago

Still with 1.18 it doesn't seem to bite

error validating data: ValidationError(HorizontalPodAutoscaler.spec): unknown field "behavior" in io.k8s.api.autoscaling.v2beta1.HorizontalPodAutoscalerSpec;

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T18:49:28Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.8-eks-7c9bda", GitCommit:"7c9bda52c425d0d56d7b93f1377a826b4132c05c", GitTreeState:"clean", BuildDate:"2020-08-28T23:04:33Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
toricls commented 3 years ago

@danijelk try v2beta2 for it.

danijelk commented 3 years ago

@toricls Ah, didn't see I was on beta1, k8s accepted it now thanks.

aniruddhch commented 3 years ago

Is there a way to set the --terminated-pod-gc-threshold on the Kube-controller-manager with EKS? A solution was suggested earlier about specifying the parameters in the AMI. Is that a recommended way to do it for now? Although, that would mean having a custom AMI that needs to be updated every time there is a new AMI version for EKS.

tabern commented 3 years ago

Closing this as setting these flags is supported in K8s v1.18 and higher.

jerry123je commented 3 years ago

@tabern, I understand the hpa.v2beta2 have ability to add behavior configuration, this resolve part of requests. However, i just curios that how can we set pod-eviction-timeout after k8s v1.18 without modifying kube-controller-manager ?

EdwinPhilip commented 3 years ago

need horizontal-pod-autoscaler-initial-readiness-delay flag to be configurable in eks, but thats not possible till now. any info on how to configure it for eks ?

lmgnid commented 3 years ago

Not sure why this ticket is closed and "Shipped"? How to set "pod-eviction-timeout" ???

mibaboo commented 3 years ago

I too require horizontal-pod-autoscaler-initial-readiness-delay on EKS and the scaling-behavior does not support this

emmeowzing commented 2 years ago

It doesn't look like I can modify --horizontal-pod-autoscaler-sync-period either.

yongzhang commented 2 years ago

also need to customize pod-eviction-timeout

sjortiz commented 2 years ago

Needing this urgently :)

marcusthelin commented 2 years ago

No status on this??

TaiSHiNet commented 2 years ago

For everyone who's following this, see #1544

dwgillies-bluescape commented 2 years ago

+1 to allow setting of the --terminated-pod-gc-threshold setting. Evicted pods are piling up in our dev clusters and the default limit of 12,500 evicted pods before garbage collection begins is way too high! We would like to reduce it to 100 !

michaelmohamed commented 2 years ago

Is there an update on this? I really need the ability to set terminated-pod-gc-threshold to use EKS.

PrettySolution commented 2 years ago

I'd like to set terminated-pod-gc-threshold to use EKS

aaronmell commented 2 years ago

FYI, we thought we needed to increase horizontal-pod-autoscaler-initial-readiness-delay, to solve an issue with autoscaling being too aggressive after rolling out new pods, and causing scaling to max out.

Our issue was actually the custom metrics we were scaling on. We were doing something like this sum(rate(container_cpu_cfs_throttled_seconds_total[1m])) The issue here is that we collect metrics every 30s, and container_cpu_cfs_throttled_seconds_total doesn't increased in a linear fashion, it tends to increase in in spurts.

We changed the rate from 1m to 2m, and that smoothed things out quite a bit and fixed our issue with aggressively scaling up.

This SO post has some good information about rate in Prometheus

https://stackoverflow.com/questions/38915018/prometheus-rate-functions-and-interval-selections

mtcode commented 1 year ago

--horizontal-pod-autoscaler-tolerance is another flag that is only customizable via controller manager flags. The v2beta2 API does not allow configuring this.

The default is 10% but I have use cases where the value should be less, making it more sensitive and responsive to changes.

sftim commented 1 year ago

Does the kube-controller-manager still support a --pod-eviction-timeout argument? The docs imply it was removed in v1.24.0 and the changelog implies it'll be removed in v1.27

daynekoroman commented 9 months ago

The default pod-eviction-timeout 5m doesn't provide opportunity to make graceful shutdown for pods on spot nodes, because when spot node goes down, we have pod running and ready until healthcheck interval, and it follows us to get 502 error from ALB

xzp1990 commented 8 months ago

Hi team, 5 minutes is too long for node issue, we hope service team can allow the user to change below setting. –node-status-update-frequency
–node-monitor-period
–node-monitor-grace-period
–pod-eviction-timeout

des1redState commented 3 weeks ago

Really gonna need to set --horizontal-pod-autoscaler-initial-readiness-delay, pretty please.