Open kaarthis opened 2 years ago
This is being closed in favor of Upgrade operation validations real time, alerts, error messaging and Doc improvements / investments.
What type of issues would you like us to flag before running upgrades ? How would you use upgrades especially Auto upgrades with this prevalidations? Would you like this validations to be automated - if so how ? How likely are these validations useful to you : if they were inside the Upgrade command vs outside as a preview ?
This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.
Kaarthi, will this also include showing linter errors as pre-validation and any autohealing actions?
This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.
What type of issues would you like us to flag before running upgrades ? a.) After upgrade not downwards compatible kubernetes resources/api versions b.) kubernetes resources/api versions which get deprecated after upgrade c.) provide an extension concept e.g a running a custom k8 task which can return success or failure How would you use upgrades especially Auto upgrades with this prevalidations? Would you like this validations to be automated - if so how ? Prevalidations shall be executed each time but in case auf auto upgrade there shall be an configurable option to NOT automatically upgrade if a.) is the case and in any case to allow to send a notification via dedicated audit log event for a.( and b.) How likely are these validations useful to you : if they were inside the Upgrade command vs outside as a preview ? Both shall be offered. Preview might be executed periodically to trigger people to do maintenance for it so it never will get real thats an upgrade is blocked for case a.)
Is this related to the issues we're experiencing with the upgrading node pools that error's before starting the upgrade process due to Pods placed on other node pools that has PDB's that doesn't allow disruptions?
I.e. in nodepool1
we have stateful singleton that runs on 1 Pod that has a PDB that doesn't allow disruptions. When trying only to upgrade nodepool2
through the Azure Portal or Azure CLI we receive an error message that it can't upgrade due to PDB's not allow disruptions.
In my opinion the PDB check should only be done against the actual node pool in scope for the upgrade. Not all PDB's in the cluster.
This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.
That’s sad, no actions and now even stale. Why is there no update although product management pretends telling to work on this?
Is this related to the issues we're experiencing with the upgrading node pools that error's before starting the upgrade process due to Pods placed on other node pools that has PDB's that doesn't allow disruptions?
I.e. in
nodepool1
we have stateful singleton that runs on 1 Pod that has a PDB that doesn't allow disruptions. When trying only to upgradenodepool2
through the Azure Portal or Azure CLI we receive an error message that it can't upgrade due to PDB's not allow disruptions.In my opinion the PDB check should only be done against the actual node pool in scope for the upgrade. Not all PDB's in the cluster.
Did you find a solution for this?
We are running into the same kind of issues on our clusters. In our case with Kafka & the Strimzi Drain Cleaner which needs a PDB with maxUnavailable: 0. (https://github.com/strimzi/drain-cleaner#see-it-in-action)
Not sure when the AKS pre-checks were added but a few month's ago an upgrade would always start without doing any pre-checks on PDB's. Now the upgrade process doesn't even start and there is no clear status available anywhere about it.
In progress, means in development. In my opinion, it's not deployed yet so that should not be the reason of your issue. You should create an issue, they may have control plane logs explaining why an upgrade does not start
Is this related to the issues we're experiencing with the upgrading node pools that error's before starting the upgrade process due to Pods placed on other node pools that has PDB's that doesn't allow disruptions? I.e. in
nodepool1
we have stateful singleton that runs on 1 Pod that has a PDB that doesn't allow disruptions. When trying only to upgradenodepool2
through the Azure Portal or Azure CLI we receive an error message that it can't upgrade due to PDB's not allow disruptions. In my opinion the PDB check should only be done against the actual node pool in scope for the upgrade. Not all PDB's in the cluster.Did you find a solution for this?
We are running into the same kind of issues on our clusters. In our case with Kafka & the Strimzi Drain Cleaner which needs a PDB with maxUnavailable: 0. (https://github.com/strimzi/drain-cleaner#see-it-in-action)
Not sure when the AKS pre-checks were added but a few month's ago an upgrade would always start without doing any pre-checks on PDB's. Now the upgrade process doesn't even start and there is no clear status available anywhere about it.
We workaround this by "upgrading" by deploying new nodepools with the latest version and draining old ones. It's a semi-manual process. But in short we:
azurerm_kubernetes_service_versions
(AKS node pool name is hash of a human readable name and version number, i.e. substr(sha256("elastic-1.24.9"),0,10)
)nodepool.our.org/name: elastic
, nodepool.our.org/version: 1.24.9
and nodepool.our.org/latest: true
where nodepool.our.org/latest
moved pods from old nodes to new nodesMigration of pods from latest to non-latest nodes is semi-manual and usually takes 1-2 weeks to fully complete before we let terraform destory the old nodepools.
In progress, means in development. In my opinion, it's not deployed yet so that should not be the reason of your issue. You should create an issue, they may have control plane logs explaining why an upgrade does not start
Respectfully, your opinion is incorrect. The manual upgrade (cli and azure portal) errors out immediately when you have a pdb with maxUnavailable 0 and azure diagnostics indicates that the pdb is the issue.
If the pdb is removed the upgrade starts without errors and completes. This indicates that there already are re-upgrade checks even though this issue has the state 'in progress' on this mostly ignored and rarely updated github repo.
I made the experience that you can apply upgrade channel on cluster based on vmas, but then it doesn't run reliable and support tells you its not supported. It's even more than a "soft" pre validation as it shall already be prevented when you try to enable the update channel via the api.
This is currently affecting all AKS clusters running effectively workloads in multiple node pools. Upgrades are blocked when running an upgrade of node image, and some pod has max disruptions allowed to 0 in totally different and unrelated nodepool.
Just to clarify the meaning of Pre check / validations is outside the upgrade operation - something like a staging operation or 'Dry run' to find all the checks / audits for possible error. This is meant to provide a simulation and snapshot before running the upgrade and seeing it fail there.
How would you like to run these pre upgrade checks prior to the upgrade? Is there a preference for Portal , CLI or API ?
cli
How would you like to run these pre upgrade checks prior to the upgrade? Is there a preference for Portal , CLI or API ?
prometheus metrics available in the cluster
@desek @svonliebenstein and others in this thread - starting from 2023-07-01 stable api and 2023-07-02-preview api, we added a forceUpgrade option which can also bypass frontend validation on pdb.
Please note that this only force the validation to go through, but upgrade could still fail in the backend when trying to drain a node due to pdb.
forceUpgrade | boolean | Whether to force upgrade the cluster. Note that this option instructs upgrade operation to bypass upgrade protections such as checking for deprecated API usage. Enable this option only with caution.
This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.
This issue will now be closed because it hasn't had any activity for 7 days after stale. kaarthis feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.
Precheck / pre-validations of possible problems before running an upgrade eg: PDB issues, IP exhaustion, Quota , SP expiry issues etc. This is not on the operation itself but meant to be done outside.