Azure / azure-cli

Azure Command-Line Interface
MIT License
4.02k stars 2.99k forks source link

az aks command invoke: does not work if user nodes have taints #25336

Open jetnet opened 1 year ago

jetnet commented 1 year ago

Describe the bug

Command Name az aks command invoke -n $AKS_NAME -c "kubectl cluster-info"

Errors:

(KubernetesOperationError) Failed to run command due to cluster perf issue, container command-0be71db980254f398cdecce07419fbed in aks-command namespace did not start within 30s on your cluster, retry may helps. If issue persist, you may need to tune your cluster with better performance (larger node/paid tier).
Code: KubernetesOperationError
Message: Failed to run command due to cluster perf issue, container command-0be71db980254f398cdecce07419fbed in aks-command namespace did not start within 30s on your cluster, retry may helps. If issue persist, you may need to tune your cluster with better performance (larger node/paid tier).

Event Message:

0/3 nodes are available: 1 node(s) had untolerated taint {agentpool: user}, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.

To Reproduce:

Steps to reproduce the behavior. Note that argument values have been redacted, as they may contain sensitive information.

Expected Behavior

aks command invoke should be able to start on system nodes with the default taint: CriticalAddonsOnly=true

Environment Summary

Linux-5.15.79.1-microsoft-standard-WSL2-x86_64-with, Alpine Linux v3.17
Python 3.10.9
Installer: PIP

azure-cli 2.44.1

Extensions:
account 0.2.5

Dependencies:
msal 1.20.0
azure-mgmt-resource 21.1.0b1

Additional Context

yonzhan commented 1 year ago

route to CXP team

PramodValavala-MSFT commented 1 year ago

@jetnet The underlying REST API for this command schedules a pod without any tolerations by default. Ideally, it would be best not to deploy non-critical workloads on a system node as it is possible that such workloads could starve resources from critical resources.

That being said, it would be best to create a feature request to add support for adding tolerations to unblock similar situations.

Since the Azure CLI itself doesn't have control over this, there is nothing that can be done in this context and should eventually get support when the underlying REST API supports it.

ghost commented 1 year ago

Hi @jetnet. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text “/unresolve” to remove the “issue-addressed” label and continue the conversation.

jetnet commented 1 year ago

@PramodValavala-MSFT, really appreciate your clarification. Should I create a feature request or are you going to do that? Thanks!

ghost commented 1 year ago

Hi @jetnet, since you haven’t asked that we “/unresolve” the issue, we’ll close this out. If you believe further discussion is needed, please add a comment “/unresolve” to reopen the issue.

jetnet commented 1 year ago

/unresolve

I think, it's an issue with the current implementation and NOT a feature request. Look, you cannot run az command invoke if your AKS user nodes have a taint. It's not OK. Please re-open. Thanks!

ghost commented 1 year ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @Azure/aks-pm.

Issue Details
## Describe the bug **Command Name** `az aks command invoke -n $AKS_NAME -c "kubectl cluster-info"` **Errors:** ``` (KubernetesOperationError) Failed to run command due to cluster perf issue, container command-0be71db980254f398cdecce07419fbed in aks-command namespace did not start within 30s on your cluster, retry may helps. If issue persist, you may need to tune your cluster with better performance (larger node/paid tier). Code: KubernetesOperationError Message: Failed to run command due to cluster perf issue, container command-0be71db980254f398cdecce07419fbed in aks-command namespace did not start within 30s on your cluster, retry may helps. If issue persist, you may need to tune your cluster with better performance (larger node/paid tier). ``` Event Message: ``` 0/3 nodes are available: 1 node(s) had untolerated taint {agentpool: user}, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling. ``` ## To Reproduce: Steps to reproduce the behavior. Note that argument values have been redacted, as they may contain sensitive information. - create a user nodepool with a taint `"agentpool=user:NoSchedule"` - try to execute command: - `az aks command invoke -n NAME -c "kubectl cluster-info"` ## Expected Behavior `aks command invoke` should be able to start on system nodes with the default taint: `CriticalAddonsOnly=true` ## Environment Summary ``` Linux-5.15.79.1-microsoft-standard-WSL2-x86_64-with, Alpine Linux v3.17 Python 3.10.9 Installer: PIP azure-cli 2.44.1 Extensions: account 0.2.5 Dependencies: msal 1.20.0 azure-mgmt-resource 21.1.0b1 ``` ## Additional Context
Author: jetnet
Assignees: -
Labels: `Service Attention`, `question`, `AKS`, `customer-reported`, `Service`, `needs-team-attention`, `Auto-Assign`
Milestone: -
PramodValavala-MSFT commented 1 year ago

@jetnet Apologies for the delay on this one! Since this requires a Service side change to support, I will be reassigning this case to the concerned team and sharing the feedback with them internally.

mjnovice commented 5 months ago

@PramodValavala-MSFT any updates on this ?