Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 306 forks source link

[BUG] VPA Preview Features disappear after Cluster Upgrade #3424

Closed pinkfloydx33 closed 1 year ago

pinkfloydx33 commented 1 year ago

Describe the bug A couple months ago I registered the VPA preview feature on our subscription and enabled the VPA on four clusters. Everything worked fine until recently when we performed some patch upgrades on our clusters. At this time, the VPAs were automatically uninstalled.

I tried reinstalling the add-on via the Azure CLI with aks update --enable-vpa. The operation completes successfully however the VPA CRDs and controller are not installed, despite the returned JSON indicating the add-on is now enabled.

To Reproduce az aks update -g resourcegroup -n clustername --enable-vpa

Expected behavior The VerticalPodAutoscaler CRDs and the three controllers in the kube-system namespace are successfully installed.

Environment (please complete the following information):

Additional context The operation takes varying degrees of time... sometimes as "short" as 8 minutes, other times as long as 30+. During that time, no pods are added to the cluster nor do any new Helm releases indicate they are attempting to install. The operation finally completes "successfully" and returns this JSON (snipped to relevant portions):

{
  // .. snip
  "workloadAutoScalerProfile": {
    "keda": null,
    "verticalPodAutoscaler": {
      "controlledValues": "RequestsAndLimits",
      "enabled": true,
      "updateMode": "Off"
    }
  }
}

I have tried the following:

Notes: I have tried unregistering and re-registering the preview feature itself. A couple things struck me as odd during this process. First of all, while the preview feature shows up via the CLI it no longer appears in the Azure Portal (subscriptions->preview features). Second, re-registering the feature happens instantly whereas it originally took several minutes. (Unregistering the feature does take a few minutes).

I tried reporting this to our enterprise Azure Support but they gave me the brush off since it's a preview feature. While I understand you likely can't help me resolve the issue directly, I figured you'd want the bug report as well as any other details. Please let me know what I can do to help debug / trace the problem and I'll provide all the details I can.

giesslg commented 1 year ago

Same issue here, we are using the keda (preview) add-on. It suddenly disappeared, we have not been able to get it working again.

pinkfloydx33 commented 1 year ago

IMO this is rather unfortunate. While I realize it's a preview feature and use-at-your-own-risk, I don't think it should just magically disappear of its own accord. Debugging seems pretty impossible as the Azure CLI and rest APIs all return succeess status codes and indicate the feature is installed when in actuality it isn't.

If the answer is "hey sorry, we're dropping this" then that's cool, but I think the docs should indicate it's no longer a feature. Otherwise it would be great to at least get a token "we're looking into it" which would be light years better than the response from Azure support... who wouldn't even assist in reporting the bug in the feature (after all, isn't that in part what a preview is for?).

If there were a pre-packaged/easily installed version of the VPA we'd not even bother with the add-on (which was the main appeal)

pinkfloydx33 commented 1 year ago

So... I had previously checked all resources to see if there were any "vpa" related items (Secrets, ConfigMaps, Deployments) lingering through my cluster. After looking at the vpa-down.sh script, I realized that I hadn't checked for the mutating webhook configuration, which to my surprise still existed. I deleted it manually and then re-ran aks update --enable vpa and the VPA controllers are now installed. However I don't know if this is going to survive through another cluster upgrade so holding out on calling this "solved".

@giesslg for your case you might wanna look for any lingering keda-related items, particularly cluster-scoped resources (for example the CRDs themselves) and see if cleaning them up helps.

robbiezhang commented 1 year ago

@pinkfloydx33, thank you for reporting the issue. do you still have the repro? If yes, can you provide your cluster info, and provide the result of helm -n kube-system ls command?

pinkfloydx33 commented 1 year ago

I resolved the immediate problem across all clusters by removing the left over mutating webhook configuration. So no I can't repro at this time. I assume however that next time we do a patch upgrade that everything is going to disappear again which is why I didn't close the ticket. Unfortunately all my clusters are on the latest version so I'd have to wait until aks publishes another.

That said, if you want I can create a new cluster using an older AKS version, install the vpa add-on and then upgrade the cluster. I assume that should reproduce (at least on my sub) since that's what happened the four other times. I can get it set up tomorrow.

huizhifan commented 1 year ago

I tried these steps to repro but the vpa was good.

  1. Create a 1.24.6 cluster.
  2. run az aks update -g {group_name} -n {cluster_name} --enable-vpa
  3. Check the vpa components' existence
  4. run az aks upgrade -g {group_name} -n {cluster_name} -k 1.25.4
  5. Check the vpa components' existence

Can you still provide your cluster info?

pinkfloydx33 commented 1 year ago

I am creating a new cluster now from mobile and will follow the same steps you just outlined (which is what I was proposing) when I'm back at my desk (tonight/tomorrow) to verify whether or not it happens again.

I can provide info, but what exactly do you want? Subscription Id and cluster name? Or something else?

huizhifan commented 1 year ago

Subscription ID, cluster name should be good enough for us to get investigate the logs. Thanks!

pinkfloydx33 commented 1 year ago

Can I provide that via email? I'd rather not post that publicly.

huizhifan commented 1 year ago

zhifanhui@microsoft.com

huizhifan commented 1 year ago

After further investigation, we are able to reproduce the issue.

  1. For the VPA missing after upgrade, it's because our current implementation doesn't allow partial put. The upgrade request doesn't have the VPA related fields and it causes the VPA object to be removed. We are working on a fix for this. In the future, the VPA will only be removed if the request has the VPA object with false enabled value.
  2. For the enable-vpa command not work scenario, we also found the root cause. In the early release, the vpa-webhook is generated by operator(admission-controller container). Later we decided to move the webhook generation to our side. VPA components deployed by AKS are through helm controller and have the related annotation. However, if you enable the VPA before our new release, the webhook has already existed in your cluster but doesn't have the annotation. The configuration we have now can't overwrite the vpa-webhook. The VPA deployment is to be terminated then. Thanks for bringing this to our sight. We are discussing the fix for this issue.

The current recommended action is to remove the MutatingWebhookConfiguration vpa-webhook manually. The VPA components should be back soon. Step:

  1. kubectl get mutatingwebhookconfiguration
  2. check the existence of vpa-webhook
  3. kubectl delete mutatingwebhookconfiguration vpa-webhook
ghost commented 1 year ago

Action required from @Azure/aks-pm

justindavies commented 1 year ago

@robbiezhang is this resolved now ?

huizhifan commented 1 year ago

@justindavies It has been resolved.