Autoscaler failed to scale up for spot nodepool due to remediator.aks.microsoft.com/unschedulable taint

nclaeys commented 2 years ago

What happened:

We encountered already 3 times the issue that the autoscaler failed to scale up from 0 for one ofour spot nodepools because it thinks the nodepool had the taint remediator.aks.microsoft.com/unschedulable on it. This is problematic since the nodepool cannot be used for new pods as they do not tolerate this taint and thus our cluster does not recover from it automatically.

There seems to be an issue that when the last node of a nodepool is tainted and removed, that the cluster autoscaler in-memory state of that nodepool keeps this remediator taint. The nodepool itself does not have this taint on it, only the autoscaler mentions it.

What you expected to happen:

If the last spot node is preemted, it is fair to add the taint and drain the node. However I expect the autoscaler to launch a new node for the same spot nodepool when new pods are created for that given nodepool. Now the autoscaler is stuck until you manually force the scale up of the nodepool after which the autoscaler starts working again.

How to reproduce it (as minimally and precisely as possible):

I have no good reproduction scenario, we encountered it a couple of times already on our staging environment.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.6", GitCommit:"07959215dd83b4ae6317b33c824f845abd578642", GitTreeState:"clean", BuildDate:"2022-03-30T18:28:25Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}
Size of cluster (how many worker nodes are in the cluster?) Typically only 2 other nodes are running: 1 from the system nodepool and 1 for another on-demand or spot nodepool
General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.) We run spark as well as generic container batch jobs
Others: I think it is related to what is described in the following Q/A: https://docs.microsoft.com/en-us/answers/questions/791916/cannot-schedule-pods-in-aks-gpu-node-remediatoraks.html

ghost commented 2 years ago

Hi nclaeys, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

ghost commented 2 years ago

Triage required from @Azure/aks-pm

nclaeys commented 2 years ago

Figured out an easy way to reproduce the issue even without the need for a scale up from 0.

Reproduction scenario

taint all nodes for a given nodepool with remediator.aks.microsoft.com/unschedulable=somevalue:NoSchedule
delete a pod that should be executed on the selected nodepool
Autoscaler will not scale up: Error: cluster-autoscaler pod didn't trigger scale-up: 3 node(s) had taint {remediator.aks.microsoft.com/unschedulable: somevalue}, that the pod didn't tolerate

You can also taint all the nodes in your cluster to reproduce the scenario.

Cluster autoscaler The cluster-autoscaler has a solution for this problem through the ignore-taint argument, which is what we use on AWS to ignore spot interruptions.

Solution to resolve the issue for AKS

Add the ignore-taint: remediator.aks.microsoft.com/unschedulable to the aks autoscaler configuration when deploying it in the control plane. This ensures that the autoscaler works correctly with the taints created by AKS
Allow users to configure extra arguments to the cluster-autoscaler in order to make it tolerant to custom taints as well

OvervCW commented 2 years ago

@nclaeys We also just had to deal with this issue. Do you think it's only reproducible with spot pools that scale from 0?

nclaeys commented 2 years ago

@OvervCW I initially thought so but now I know better. It can be reproduced by tainting all nodes of the nodepool and then trying to scale up for that nodepool. We had a similar issue on AWS and that is how I tested it out and figured out the ignore-taint fixed the issue.

ghost commented 2 years ago

Action required from @Azure/aks-pm

ghost commented 2 years ago

Action required from @Azure/aks-pm

gandhipr commented 2 years ago

@nclaeys If you taint all the nodes in the nodepool but one, scale-up will be successful. But if all the nodes are tainted - scale-up will fail. This is by design that if all the nodes in a nodepool are tainted (not allowing more pods on existing nodes) - that means we don't want to schedule more pods in that nodepool. I tested a scenario where nodepool with min-size = 0, cordoned all the nodes in that nodepool (nodes were tainted and drained). CA deleted these nodes after they were unneeded for a while. When new pods were now created, nodepool was scaled-up and pods were scheduled. Were you testing with the same setup?

Could you please let me know the --min-size parameter for the spot nodepool you specified? I might be in a better position to help.

Also, could you please elaborate on the use-case for having ignore-taints parameter for both - ignoring aks managed taints and user added taints?

nclaeys commented 2 years ago

@gandhipr

Test scenario? Just cordoning the node is not enough, that works correctly but it does not reproduce the actual problem. Issues only arise when you taint the nodes externally, not by the cluster autoscaler itself.

All the nodes are tainted - scale-up will fail by design? What is the reasoning that you do not scale up when all nodes are tainted with the remediator taint? This is exactly the issue we are encountering. The last node from our nodepool is deleted and had the remediator taint on it. From that moment the nodepool can never scale up automatically and we need to manually intervene to get the issue resolved. I do not see a reason why you would not allow a scale up anymore in this case, could you elaborate? Same applies when all nodes would have the remediator taint, I do not see why you would not want to scale up. For the record, the issue only occurs when scaling up from 0 but to me it looks like the same scenario.

This is exactly the reason why the cluster autoscaler allows the ignore-taint options since for certain taints you still want to scale up since the taint for example does not impact new nodes coming up. This applies in my opinion for the remediator taint and in our case also for the spot-interruption taint on AWS set by the aws-node-termination-handler.

Nodepools used? In our cluster we manage multiple nodepools with different sku's of nodes such that we can decide to run large jobs on the large nodes and smaller jobs on smaller nodes. All nodepools have a min-size of 0 since we run batch workloads and do not want nodes running unnecessary. The issue happens almost exclusively with the nodepool with the largest sku, since that one is not used all the time.

Use cases for ignore-taint option? At the moment my main concern is to be able to handle with aks managed taints as these break our cluster at the moment. It might be useful to allow for user added taints, for example on AWS we use the node-termination handler which adds a custom taint to the node, which can cause the same issue which is why we add it to the ignore-taints parameter.

nclaeys commented 2 years ago

Is there any update on this? Since it takes this long to make progress on this ticket, our current workaround is to run the cluster-autoscaler ourselves. This way we can ignore the remediator taints. For others having the same issue, you can take a look at: autoscaler-helm-chart I ignore the following taints (we are using the aks-node-termination handler):

"aks-node-termination-handler/preempt"
"aks-node-termination-handler/freeze"
"aks-node-termination-handler/terminate"
"aks-node-termination-handler/reboot"
"aks-node-termination-handler/redeploy"
"remediator.aks.microsoft.com/unschedulable"

JeromeSoussens commented 1 year ago

Hi all,

We had the same issue but on regular nodes when scaling down from 1 to 0 node and when the scaled-down node has a taint preventing pod scheduling on it : our last node had a taint put by a deamonset to indicate a NFS mount on the host is not available preventing pod scheduling on it.

This taint seems to be kept in memory in the auto-scaler.

The situation has been resolved after a cluster autoscaler restart.

Is there any update on this issue @pavneeta ?

duhow commented 1 year ago

This is affecting us as well. 3 days ago we received several Health Event Activated with CPU Pressure / The AKS cluster does not have enough CPU to run all workloads. , which I suspect this is "Spot preemption".

After this got resolved (Spot instances available for request), my AKS cluster still won't perform autoscaling with error:

Normal   NotTriggerScaleUp  48s (x35 over 41h)     cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had untolerated taint {sku: gpu}, 4 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {remediator.aks.microsoft.com/unschedulable: }, 1 max node group size reached

There are resources available, but won't trigger scale up of the nodepool due to:

2 node(s) had untolerated taint {remediator.aks.microsoft.com/unschedulable: }

🔧 Workaround is to manually scale the nodepool (add instances), and later autoscaling works again.

I cannot find any other information details of this issue or how to fix it properly, but this is still ongoing.

gandhipr commented 1 year ago

I see 2 issues - custom taints and remediator taint added by aks

custom taints can be ignored by prefixing the custom taints with IgnoreTaintPrefix = "ignore-taint.cluster-autoscaler.kubernetes.io/" as mentioned here - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/utils/taints/taints.go#L47
I'll look into the remediator taint issue and get back.

gpanagiotidis commented 1 year ago

We are facing this issue as well.

comtalyst commented 1 year ago

Sorry to keep you waiting. The issue should be fixed since v20230430 which should have finished rolling out. Cluster Autoscaler should now scales up even when all nodes are tainted with that taint.

Azure / AKS

Autoscaler failed to scale up for spot nodepool due to remediator.aks.microsoft.com/unschedulable taint #2976