Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 307 forks source link

[Features] Please provide functionality through API to force drain node pool #2090

Open t3mi opened 3 years ago

t3mi commented 3 years ago

What happened: In case of small sized node pool or tolerations on it which prevents re-scheduling of the pods configured with pod disruption budget from bigger node pool following error occurs during node pool removal:

{
  "code":  "PodDrainFailure",
  "message":  "Node 'aks-vmss-17400673-vmss000001' failed to be drained with error: 'Drain did not complete pods [<pod-name>] within 10m0s'" 
}

What you expected to happen: Node pool successfully removed.

How to reproduce it (as minimally and precisely as possible): Deploy cluster with additional node pool and taints for it. Deploy application with pod disruption budget set inside the nodepool and try to remove the node pool.

Anything else we need to know?: For the reference, there is a flag disable-eviction for kubectl to force drain a node.

Environment:

ghost commented 3 years ago

Hi t3mi, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

ghost commented 3 years ago

Triage required from @Azure/aks-pm

ghost commented 3 years ago

Action required from @Azure/aks-pm

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

miwithro commented 3 years ago

@paulgmiller can your team assist?

ghost commented 3 years ago

Triage required from @Azure/aks-pm

ghost commented 3 years ago

Action required from @Azure/aks-pm

miwithro commented 3 years ago

@marwanad can you assist?

paulgmiller commented 3 years ago

@t3mi @miwithro @marwanad Sorry I missed this. We are considering this option for a specific node pool (I jokingly refering to it as the "--yolo" flag). We're a little concerned that we're giving people a gun to shoot themselves in the foot. Is there a reason you prefer a flag over removing/modifying the PodDisruptionBuget in question? Worry that disable eviction ignoreds every deployment while with a poddisriptionbudget you can be more selective about important services.

ghost commented 3 years ago

Triage required from @Azure/aks-pm

vapomalov commented 3 years ago

@paulgmiller OP needs it in this case for destroying a cluster with terraform - I think. Hence the ticket reference. At least in our case we want to replace one nodepool through another with terraform.

ghost commented 3 years ago

Action required from @Azure/aks-pm

t3mi commented 3 years ago

@paulgmiller we are using terraform during deployment and in our CI so during teardown of clusters with nodepools we're having such errors. Instead of adding additional step to gather/remove all PodDisruptionBuget objects and slow CI tests we would prefer for additional flag to be present to force remove nodepools without caring what's inside.

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

vapomalov commented 3 years ago

We opened also an issue at Microsoft Azure support TrackingID #2103170050002008. Unfortunately while is was on a leave it was closed as "not-an-issue". This feature is implemented by every big cloud provider. I am currently reopening the issue.

palma21 commented 3 years ago

We will consider this feature for nodepool delete to start with.

aks nodepool delete --force

alvinli222 commented 2 years ago

We have done some initial ideation on this feature and will be working to release it in the coming months.

digihunch commented 2 years ago

I had to remember to remove PodDisruptionBuget with ALLOWED DISRUPTIONS = 0 before destroy the node group This feature requested would be very helpful

CpuID commented 2 years ago

@palma @qpetraroia @justindavies any ETA on when aks nodepool delete --force will be ready...? :)

miwithro commented 1 year ago

@kaarthis can you look into this one?

kaarthis commented 1 year ago

Yes part of discussion on the Nodepool API. i can look into this and report back.

aslafy-z commented 1 year ago

Any news on this one? This issue becomes blocking for us. We want to be able to destroy our dev clusters whatever they state are.

The documentation states that no drains are done on nodepool removal, however, it seems that the eviction controller is called in some way. cf https://learn.microsoft.com/en-us/azure/aks/resize-node-pool?tabs=azure-cli#remove-the-existing-node-pool

Any progress @kaarthis @palma @qpetraroia @justindavies @alvinli222?

MattJeanes commented 11 months ago

We're also running into this via Terraform, I think an option to force delete makes sense to me

Code="KubernetesAPICallFailed" Message="Drain node akswin00000q failed when evicting pod rendering-deployment-84cbc76c47-kccm9. Eviction failed with Too many Requests error. This is often caused by a restrictive Pod Disruption Budget (PDB) policy. See http://aka.ms/aks/debugdrainfailures. Original error: API call to Kubernetes API Server failed."

In our particular case, we're trying to delete the whole cluster but Terraform is trying to delete the individual node pools first. We're working around this by just deleting the entire cluster which appears to avoid this problem but it is still something I'd love to see improved.

oferchen commented 11 months ago

+1

alvinli222 commented 10 months ago

Hi everyone, this feature we pushed into Public Preview awhile back but there was some bugs that were found. We are actively working on this now and hope to re-release to Public Preview ASAP with an API property flag and also corresponding CLI command. I have communicated this update with the Terraform team as well.

landerss1 commented 5 months ago

Hi! Any update on this? Without the ability to force drain nodes, the existing feature of being able to start/stop nodepools is not very useful.