PUT Pre-Flight Validations

kaarthis commented 2 years ago

Precheck / pre-validations of possible problems before running an upgrade eg: PDB issues, IP exhaustion, Quota , SP expiry issues etc. This is not on the operation itself but meant to be done outside.

kaarthis commented 2 years ago

This is being closed in favor of Upgrade operation validations real time, alerts, error messaging and Doc improvements / investments.

kaarthis commented 2 years ago

What type of issues would you like us to flag before running upgrades ? How would you use upgrades especially Auto upgrades with this prevalidations? Would you like this validations to be automated - if so how ? How likely are these validations useful to you : if they were inside the Upgrade command vs outside as a preview ?

ghost commented 2 years ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

yeturis commented 2 years ago

Kaarthi, will this also include showing linter errors as pre-validation and any autohealing actions?

ghost commented 1 year ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

cforce commented 1 year ago

What type of issues would you like us to flag before running upgrades ? a.) After upgrade not downwards compatible kubernetes resources/api versions b.) kubernetes resources/api versions which get deprecated after upgrade c.) provide an extension concept e.g a running a custom k8 task which can return success or failure How would you use upgrades especially Auto upgrades with this prevalidations? Would you like this validations to be automated - if so how ? Prevalidations shall be executed each time but in case auf auto upgrade there shall be an configurable option to NOT automatically upgrade if a.) is the case and in any case to allow to send a notification via dedicated audit log event for a.( and b.) How likely are these validations useful to you : if they were inside the Upgrade command vs outside as a preview ? Both shall be offered. Preview might be executed periodically to trigger people to do maintenance for it so it never will get real thats an upgrade is blocked for case a.)

desek commented 1 year ago

Is this related to the issues we're experiencing with the upgrading node pools that error's before starting the upgrade process due to Pods placed on other node pools that has PDB's that doesn't allow disruptions?

I.e. in nodepool1 we have stateful singleton that runs on 1 Pod that has a PDB that doesn't allow disruptions. When trying only to upgrade nodepool2 through the Azure Portal or Azure CLI we receive an error message that it can't upgrade due to PDB's not allow disruptions.

In my opinion the PDB check should only be done against the actual node pool in scope for the upgrade. Not all PDB's in the cluster.

ghost commented 1 year ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

cforce commented 1 year ago

That’s sad, no actions and now even stale. Why is there no update although product management pretends telling to work on this?

svonliebenstein commented 1 year ago

Is this related to the issues we're experiencing with the upgrading node pools that error's before starting the upgrade process due to Pods placed on other node pools that has PDB's that doesn't allow disruptions?

I.e. in nodepool1 we have stateful singleton that runs on 1 Pod that has a PDB that doesn't allow disruptions. When trying only to upgrade nodepool2 through the Azure Portal or Azure CLI we receive an error message that it can't upgrade due to PDB's not allow disruptions.

In my opinion the PDB check should only be done against the actual node pool in scope for the upgrade. Not all PDB's in the cluster.

Did you find a solution for this?

We are running into the same kind of issues on our clusters. In our case with Kafka & the Strimzi Drain Cleaner which needs a PDB with maxUnavailable: 0. (https://github.com/strimzi/drain-cleaner#see-it-in-action)

Not sure when the AKS pre-checks were added but a few month's ago an upgrade would always start without doing any pre-checks on PDB's. Now the upgrade process doesn't even start and there is no clear status available anywhere about it.

lgmorand commented 1 year ago

In progress, means in development. In my opinion, it's not deployed yet so that should not be the reason of your issue. You should create an issue, they may have control plane logs explaining why an upgrade does not start

desek commented 1 year ago

Is this related to the issues we're experiencing with the upgrading node pools that error's before starting the upgrade process due to Pods placed on other node pools that has PDB's that doesn't allow disruptions? I.e. in nodepool1 we have stateful singleton that runs on 1 Pod that has a PDB that doesn't allow disruptions. When trying only to upgrade nodepool2 through the Azure Portal or Azure CLI we receive an error message that it can't upgrade due to PDB's not allow disruptions. In my opinion the PDB check should only be done against the actual node pool in scope for the upgrade. Not all PDB's in the cluster.

Did you find a solution for this?

We are running into the same kind of issues on our clusters. In our case with Kafka & the Strimzi Drain Cleaner which needs a PDB with maxUnavailable: 0. (https://github.com/strimzi/drain-cleaner#see-it-in-action)

Not sure when the AKS pre-checks were added but a few month's ago an upgrade would always start without doing any pre-checks on PDB's. Now the upgrade process doesn't even start and there is no clear status available anywhere about it.

We workaround this by "upgrading" by deploying new nodepools with the latest version and draining old ones. It's a semi-manual process. But in short we:

Use terraform to lifecycle node pools based on available versions based on the data source azurerm_kubernetes_service_versions (AKS node pool name is hash of a human readable name and version number, i.e. substr(sha256("elastic-1.24.9"),0,10))
Use custom labels on node pools for pod placement, i.e. nodepool.our.org/name: elastic, nodepool.our.org/version: 1.24.9 and nodepool.our.org/latest: true where nodepool.our.org/latest moved pods from old nodes to new nodes

Migration of pods from latest to non-latest nodes is semi-manual and usually takes 1-2 weeks to fully complete before we let terraform destory the old nodepools.

rouke-broersma commented 1 year ago

In progress, means in development. In my opinion, it's not deployed yet so that should not be the reason of your issue. You should create an issue, they may have control plane logs explaining why an upgrade does not start

Respectfully, your opinion is incorrect. The manual upgrade (cli and azure portal) errors out immediately when you have a pdb with maxUnavailable 0 and azure diagnostics indicates that the pdb is the issue.

If the pdb is removed the upgrade starts without errors and completes. This indicates that there already are re-upgrade checks even though this issue has the state 'in progress' on this mostly ignored and rarely updated github repo.

cforce commented 1 year ago

I made the experience that you can apply upgrade channel on cluster based on vmas, but then it doesn't run reliable and support tells you its not supported. It's even more than a "soft" pre validation as it shall already be prevented when you try to enable the update channel via the api.

zmalik commented 1 year ago

This is currently affecting all AKS clusters running effectively workloads in multiple node pools. Upgrades are blocked when running an upgrade of node image, and some pod has max disruptions allowed to 0 in totally different and unrelated nodepool.

kaarthis commented 1 year ago

Just to clarify the meaning of Pre check / validations is outside the upgrade operation - something like a staging operation or 'Dry run' to find all the checks / audits for possible error. This is meant to provide a simulation and snapshot before running the upgrade and seeing it fail there.

aritraghosh commented 1 year ago

How would you like to run these pre upgrade checks prior to the upgrade? Is there a preference for Portal , CLI or API ?

cforce commented 1 year ago

cli

rouke-broersma commented 1 year ago

How would you like to run these pre upgrade checks prior to the upgrade? Is there a preference for Portal , CLI or API ?

prometheus metrics available in the cluster

wenjungaogaogao commented 1 year ago

@desek @svonliebenstein and others in this thread - starting from 2023-07-01 stable api and 2023-07-02-preview api, we added a forceUpgrade option which can also bypass frontend validation on pdb.

Please note that this only force the validation to go through, but upgrade could still fail in the backend when trying to drain a node due to pdb.

forceUpgrade | boolean | Whether to force upgrade the cluster. Note that this option instructs upgrade operation to bypass upgrade protections such as checking for deprecated API usage. Enable this option only with caution.

https://learn.microsoft.com/en-us/rest/api/aks/managed-clusters/create-or-update?tabs=HTTP#upgradeoverridesettings

microsoft-github-policy-service[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

microsoft-github-policy-service[bot] commented 9 months ago

This issue will now be closed because it hasn't had any activity for 7 days after stale. kaarthis feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.

Azure / AKS

PUT Pre-Flight Validations #2986