Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.95k stars 305 forks source link

[BUG] Enabling AKS Cost Analysis fails because draining node fails #4306

Closed mortenjoenby closed 2 weeks ago

mortenjoenby commented 4 months ago

Describe the bug I am trying to enable Cost Analysis on a dev AKS cluster following the documentation. I see these events in AKS:

`Events: Type Reason Age From Message


Normal RegisteredNode 55m node-controller Node aks-steps01-35369385-vmss00000m event: Registered Node aks-steps01-35369385-vmss00000m in Controller Normal RegisteredNode 11m node-controller Node aks-steps01-35369385-vmss00000m event: Registered Node aks-steps01-35369385-vmss00000m in Controller Normal Drain 7m20s upgrader Draining node: aks-steps01-35369385-vmss00000m Normal NodeNotSchedulable 7m11s kubelet Node aks-steps01-35369385-vmss00000m status is now: NodeNotSchedulable Warning Drain 4m39s (x33 over 7m19s) upgrader Eviction blocked by Too many Requests (usually a pdb): testk8sdev1-dev01-backend-mh-sts-0 Warning Drain 4m39s (x33 over 7m19s) upgrader Eviction blocked by Too many Requests (usually a pdb): testk8sdev1-dev01-backend-fg-sts-0 Warning Drain 2m18s (x61 over 7m19s) upgrader Eviction blocked by Too many Requests (usually a pdb): testk8sdev1-dev01-backend-bg-sts-0`

I don't really understand if and why the node is being drained? If you are really trying to drain all nodes when enabling this, this MUST be highlighted with capital RED letters! We can't afford to have workloads being shutdown because of this.

To Reproduce Steps to reproduce the behavior: I am running. az aks update --resource-group aks-we-dev01-2402 --name aks-we-dev01-2402 --enable-cost-analysis

It errors out after 30 minutes. (UpgradeFailed) Drain node aks-steps01-35369385-vmss000007 failed when evicting pod pacacust02-ccdg01-backend-sts-0. Eviction failed with Too many Requests error. This is often caused by a restrictive Pod Disruption Budget (PDB) policy. See http://aka.ms/aks/debugdrainfailures. Original error: API call to Kubernetes API Server failed.

Code: UpgradeFailed Message: Drain node aks-steps01-35369385-vmss000007 failed when evicting pod pacacust02-ccdg01-backend-sts-0. Eviction failed with Too many Requests error. This is often caused by a restrictive Pod Disruption Budget (PDB) policy. See http://aka.ms/aks/debugdrainfailures. Original error: API call to Kubernetes API Server failed.

Expected behavior I expect Cost Analysis to be enabled pretty quickly without service interruption.

Environment (please complete the following information): az version { "azure-cli": "2.60.0", "azure-cli-core": "2.60.0", "azure-cli-telemetry": "1.1.0", "extensions": { "account": "0.2.5", "aks-preview": "4.0.0b4", "costmanagement": "0.2.1", "reservation": "0.3.0", "storage-blob-preview": "0.7.2", "storage-preview": "0.8.4", "vm-repair": "0.5.3" } }

JoeyC-Dev commented 4 months ago

I have a high suspect that this is not caused by this feature, The node upgrading should never be a part of enabling Cost Analysis. I neither find this reproducible nor can find my node being upgraded/re-imaged. image

I believe you should submit a support ticket as this should be triggered by other thing.

mortenjoenby commented 4 months ago

Hi @JoeyC-Dev . OK, fine. I never get that output you see and it never says "/ Running ..." It fails after about 20 minutes. I will create a support ticket.

AllenWen-at-Azure commented 1 month ago

Hi @mortenjoenby did you file a support ticket and get help from Microsoft support engineer? Does the issue still exist in your cluster?

microsoft-github-policy-service[bot] commented 2 weeks ago

This issue will now be closed because it hasn't had any activity for 7 days after stale. mortenjoenby feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.