Azure / azure-cli

Azure Command-Line Interface
MIT License
4.02k stars 2.99k forks source link

AKS Polling fails after 15 min #26654

Open SebblerX3 opened 1 year ago

SebblerX3 commented 1 year ago

Describe the bug

When starting an AKS via az-cli, we get an error if the start-up takes too long (approx after 15 min). If the operation is faster, the CLI works as expected. It is noticeable that the CLI consumes approx. 100-150MiB of memory after 15 min and is not always released correctly afterwards.

Related command

az aks start --name foo --resource-group bar

Errors

There are none, the process exits with code 137.

Issue script & Debug output

Unfortunately the log is too long for the issue. Full log is available here

Expected behavior

The CLI should not fail but wait for the operation to complete.

Environment Summary

azure-cli 2.49.0

core 2.49.0 telemetry 1.0.8

Extensions: azure-devops 0.25.0

Dependencies: msal 1.20.0 azure-mgmt-resource 22.0.0

Python location '/usr/bin/python3.9' Extensions directory '/opt/azcliextensions'

Python (Linux) 3.9.16 (main, Dec 21 2022, 10:57:18) [GCC 8.5.0 20210514 (Red Hat 8.5.0-17)]

Legal docs and information: aka.ms/AzureCliLegal

Unable to check if your CLI is up-to-date. Check your internet connection.

Additional context

The az-cli is invoked within a podman-container on a RHEL 8 machiene as part of an Azure DevOps pipeline.

yonzhan commented 1 year ago

Thank you for opening this issue, we will look into it.

ghost commented 1 year ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @Azure/aks-pm.

Issue Details
### Describe the bug When starting an AKS via az-cli, we get an error if the start-up takes too long (approx after 15 min). If the operation is faster, the CLI works as expected. It is noticeable that the CLI consumes approx. 100-150MiB of memory after 15 min and is not always released correctly afterwards. ### Related command az aks start --name foo --resource-group bar ### Errors There are none, the process exits with code 137. ### Issue script & Debug output Unfortunately the log is too long for the issue. Full log is available [here](https://gist.github.com/SebblerX3/e7d031f052bff9481ab0eab77ce6d0be) ### Expected behavior The CLI should not fail but wait for the operation to complete. ### Environment Summary azure-cli 2.49.0 core 2.49.0 telemetry 1.0.8 Extensions: azure-devops 0.25.0 Dependencies: msal 1.20.0 azure-mgmt-resource 22.0.0 Python location '/usr/bin/python3.9' Extensions directory '/opt/azcliextensions' Python (Linux) 3.9.16 (main, Dec 21 2022, 10:57:18) [GCC 8.5.0 20210514 (Red Hat 8.5.0-17)] Legal docs and information: aka.ms/AzureCliLegal Unable to check if your CLI is up-to-date. Check your internet connection. ### Additional context The az-cli is invoked within a podman-container on a RHEL 8 machiene as part of an Azure DevOps pipeline.
Author: SebblerX3
Assignees: -
Labels: `bug`, `Service Attention`, `AKS`, `customer-reported`, `Auto-Assign`
Milestone: -
navba-MSFT commented 1 year ago

Adding Service team to look into this.

SebblerX3 commented 1 year ago

Any updates?

FumingZhang commented 1 year ago

Hey @SebblerX3, sorry for the late reply. As far as I know, aks command does not set the polling timeout and it seems the command is killed by OS.

2023-06-08T04:00:12.2291711Z ##[error]Bash exited with code '137'.

For operation b7232b26-c517-4efe-84c4-5663da026ae0, it was triggered on 2023-06-08. The log on the server side is not retained for such a long time. I am sorry that I cannot check what caused the cluster to not complete the start operation in a short period of time.