Closed pinkfloydx33 closed 1 year ago
Same issue here, we are using the keda (preview) add-on. It suddenly disappeared, we have not been able to get it working again.
IMO this is rather unfortunate. While I realize it's a preview feature and use-at-your-own-risk, I don't think it should just magically disappear of its own accord. Debugging seems pretty impossible as the Azure CLI and rest APIs all return succeess status codes and indicate the feature is installed when in actuality it isn't.
If the answer is "hey sorry, we're dropping this" then that's cool, but I think the docs should indicate it's no longer a feature. Otherwise it would be great to at least get a token "we're looking into it" which would be light years better than the response from Azure support... who wouldn't even assist in reporting the bug in the feature (after all, isn't that in part what a preview is for?).
If there were a pre-packaged/easily installed version of the VPA we'd not even bother with the add-on (which was the main appeal)
So... I had previously checked all resources to see if there were any "vpa" related items (Secrets, ConfigMaps, Deployments) lingering through my cluster. After looking at the vpa-down.sh
script, I realized that I hadn't checked for the mutating webhook configuration, which to my surprise still existed. I deleted it manually and then re-ran aks update --enable vpa
and the VPA controllers are now installed. However I don't know if this is going to survive through another cluster upgrade so holding out on calling this "solved".
@giesslg for your case you might wanna look for any lingering keda-related items, particularly cluster-scoped resources (for example the CRDs themselves) and see if cleaning them up helps.
@pinkfloydx33, thank you for reporting the issue. do you still have the repro? If yes, can you provide your cluster info, and provide the result of helm -n kube-system ls
command?
I resolved the immediate problem across all clusters by removing the left over mutating webhook configuration. So no I can't repro at this time. I assume however that next time we do a patch upgrade that everything is going to disappear again which is why I didn't close the ticket. Unfortunately all my clusters are on the latest version so I'd have to wait until aks publishes another.
That said, if you want I can create a new cluster using an older AKS version, install the vpa add-on and then upgrade the cluster. I assume that should reproduce (at least on my sub) since that's what happened the four other times. I can get it set up tomorrow.
I tried these steps to repro but the vpa was good.
az aks update -g {group_name} -n {cluster_name} --enable-vpa
az aks upgrade -g {group_name} -n {cluster_name} -k 1.25.4
Can you still provide your cluster info?
I am creating a new cluster now from mobile and will follow the same steps you just outlined (which is what I was proposing) when I'm back at my desk (tonight/tomorrow) to verify whether or not it happens again.
I can provide info, but what exactly do you want? Subscription Id and cluster name? Or something else?
Subscription ID, cluster name should be good enough for us to get investigate the logs. Thanks!
Can I provide that via email? I'd rather not post that publicly.
zhifanhui@microsoft.com
After further investigation, we are able to reproduce the issue.
enabled
value.enable-vpa
command not work scenario, we also found the root cause. In the early release, the vpa-webhook
is generated by operator(admission-controller container). Later we decided to move the webhook generation to our side. VPA components deployed by AKS are through helm controller and have the related annotation. However, if you enable the VPA before our new release, the webhook has already existed in your cluster but doesn't have the annotation. The configuration we have now can't overwrite the vpa-webhook
. The VPA deployment is to be terminated then. Thanks for bringing this to our sight. We are discussing the fix for this issue.The current recommended action is to remove the MutatingWebhookConfiguration vpa-webhook
manually. The VPA components should be back soon.
Step:
kubectl get mutatingwebhookconfiguration
vpa-webhook
kubectl delete mutatingwebhookconfiguration vpa-webhook
Action required from @Azure/aks-pm
@robbiezhang is this resolved now ?
@justindavies It has been resolved.
Describe the bug A couple months ago I registered the VPA preview feature on our subscription and enabled the VPA on four clusters. Everything worked fine until recently when we performed some patch upgrades on our clusters. At this time, the VPAs were automatically uninstalled.
I tried reinstalling the add-on via the Azure CLI with
aks update --enable-vpa
. The operation completes successfully however the VPA CRDs and controller are not installed, despite the returned JSON indicating the add-on is now enabled.To Reproduce
az aks update -g resourcegroup -n clustername --enable-vpa
Expected behavior The
VerticalPodAutoscaler
CRDs and the three controllers in thekube-system
namespace are successfully installed.Environment (please complete the following information):
aks-preview
0.5.122Additional context The operation takes varying degrees of time... sometimes as "short" as 8 minutes, other times as long as 30+. During that time, no pods are added to the cluster nor do any new Helm releases indicate they are attempting to install. The operation finally completes "successfully" and returns this JSON (snipped to relevant portions):
I have tried the following:
--disable-vpa
followed by--enable-vpa
multiple timesworkloadAutoScalerProfile
orworkloadAutoScalerProfile.verticalPodAutoscaler
set tonull
--enable-vpa
callNotes: I have tried unregistering and re-registering the preview feature itself. A couple things struck me as odd during this process. First of all, while the preview feature shows up via the CLI it no longer appears in the Azure Portal (subscriptions->preview features). Second, re-registering the feature happens instantly whereas it originally took several minutes. (Unregistering the feature does take a few minutes).
I tried reporting this to our enterprise Azure Support but they gave me the brush off since it's a preview feature. While I understand you likely can't help me resolve the issue directly, I figured you'd want the bug report as well as any other details. Please let me know what I can do to help debug / trace the problem and I'll provide all the details I can.