Closed vijaygos closed 5 years ago
👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.
here is the output of the cluster-provision.log
++ date ++ hostname
Creating config file /etc/idmapd.conf with new version debconf: unable to initialize frontend: Dialog debconf: (TERM is not set, so the dialog frontend is not usable.) debconf: falling back to frontend: Readline
Creating config file /etc/default/nfs-common with new version
Adding system user statd' (UID 113) ... sent invalidate(passwd) request, exiting sent invalidate(group) request, exiting Adding new user
statd' (UID 113) with group nogroup' ... sent invalidate(passwd) request, exiting sent invalidate(group) request, exiting sent invalidate(passwd) request, exiting sent invalidate(group) request, exiting sent invalidate(passwd) request, exiting sent invalidate(group) request, exiting sent invalidate(passwd) request, exiting sent invalidate(group) request, exiting sent invalidate(passwd) request, exiting sent invalidate(group) request, exiting sent invalidate(passwd) request, exiting sent invalidate(group) request, exiting Not creating home directory
/var/lib/nfs'.
nfs-utils.service is a disabled or a static unit, not starting it.
Setting up zip (3.0-11) ...
Setting up cgroup-lite (1.11) ...
Setting up libipset3:amd64 (6.29-1) ...
Setting up ipset (6.29-1) ...
Setting up socat (1.7.3.1-1) ...
Setting up libnss3-nssdb (2:3.28.4-0ubuntu0.16.04.6) ...
Setting up libnss3:amd64 (2:3.28.4-0ubuntu0.16.04.6) ...
Setting up librados2 (10.2.11-0ubuntu0.16.04.2) ...
Setting up librbd1 (10.2.11-0ubuntu0.16.04.2) ...
Setting up libcephfs1 (10.2.11-0ubuntu0.16.04.2) ...
Setting up python-cephfs (10.2.11-0ubuntu0.16.04.2) ...
Setting up python-rados (10.2.11-0ubuntu0.16.04.2) ...
Setting up python-rbd (10.2.11-0ubuntu0.16.04.2) ...
Setting up libradosstriper1 (10.2.11-0ubuntu0.16.04.2) ...
Setting up librgw2 (10.2.11-0ubuntu0.16.04.2) ...
Setting up ceph-common (10.2.11-0ubuntu0.16.04.2) ...
[/usr/lib/tmpfiles.d/var.conf:14] Duplicate line for path "/var/log", ignoring.
Processing triggers for libc-bin (2.23-0ubuntu10) ...
Processing triggers for dbus (1.10.6-1ubuntu3.3) ...
Processing triggers for ureadahead (0.100.0-19) ...
Processing triggers for ca-certificates (20170717~16.04.2) ...
Updating certificates in /etc/ssl/certs...
0 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d...
done.
Processing triggers for systemd (229-4ubuntu21.4) ...
Bridge chain: PREROUTING, entries: 0, policy: ACCEPT
Bridge chain: OUTPUT, entries: 0, policy: ACCEPT
Bridge chain: POSTROUTING, entries: 0, policy: ACCEPT
@vijaygos are you sure these are logs from an instance with a failure? These logs look like a successful run to me. (Custom script finished successfully
)
@CecileRobertMichon , I can see the VMSS status as "Failed". However, I am not sure how to determine which is the "failed instance". Is there any way to find the failed instance from the portal?
@vijaygos I don't think the portal shows which instance failed unfortunately. We have an issue open to improve extension logs to print the hostname but in the meantime there's no easy way to get the instance ID that I know of. See https://github.com/Azure/aks-engine/issues/1496. If you are able to repro the issue with scaling a few instances at a time that might be the easiest way to know which instance to get the logs from.
Apparently you can also get to the extension status in the portal from the instance status:
Ah! Thanks @CecileRobertMichon for pointing me to that. This is rather strange. I have looked at the status for all the 31 VMs in our cluster and they all show "ProvisioningState/succeeded". However, the VMSS Overview page shows a Failed status.
We are experiencing the exact same thing. Any updates on what could be wrong?
Also, Experiencing the same issue after updating ServicePrincipal credentials 'Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained.'
Also getting this issue today though no changes from cluster perspective. We were getting a rate limit issue earlier.
Is there a common pattern in terms of cluster operations? Are these new cluster buildouts? Or a result of scale events on existing clusters?
Answering for @sylus - it's an existing cluster. We haven't made changes to the scale set ourselves - the last scale operation was last week. We found the issue when we had teams reporting that pods with disks weren't coming up.
It now seems that it's unable to mount the disks because it can't unmount them from the old nodes.
We seemed to hit a subscription write limit earlier today - though I'm not sure if that's related to this issue or not (if it was retrying too often).
We experience the same issue. Existing cluster, no changes regarding scale sets, but pods can't mount data disks and keep hanging in 'Init'.
Same here; it worked fine last evening and today no VM is able to attach any disks, displaying the Provisioning of VM extension 'vmssCSE' has timed out.
failure state in the VMSS (the AKS cluster is in westeurope).
We filed an Azure support ticket and supposed to get a call back this morning will post back any info.
I was able to restart our VMSS nodes gracefully by running the following az cli command. Previously restarting the vmss node using the GUI also resulted in Provisioning of VM extension 'vmssCSE' has timed out.
Command:
az vmss update-instances -g {RESOUCE_GROUP} --name {VMSS_AGENTPOOL} --instance-id {INSTANCE_ID}
And the Cluster status resolved, All hung up Disks freed themselves and reattach without an issue.
Unfortunately, in my case, the outcome of the above wasn't any different than the GUI triggered process:
$ az vmss update-instances -g {group} --name {vmss} --instance-id {id}
Deployment failed. Correlation ID: {id}. Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained.
We have the same issue except it affects both vmssCSE and RunCommandLinux extensions. We didn't enable or disable any extensions. I tried to do the update command but it never completes successfully failing with this:
cli.azure.cli.core.util : Deployment failed. Correlation ID: abeff543-0a78-4c26-aa0f-dc2d01b71222. Provisioning of VM extension 'RunCommandLinux' has timed out. Extension installation may be taking too long, or extension status could not be obtained.
Deployment failed. Correlation ID: abeff543-0a78-4c26-aa0f-dc2d01b71222. Provisioning of VM extension 'RunCommandLinux' has timed out. Extension installation may be taking too long, or extension status could not be obtained.
I tried to reinstall the extension with force and it also failed. This is a critical bug and is affecting our production. How has this not been addressed? We don't need this extension. Nobody needs to run linux commands via a cli extension and this should not be there by default. It's a bad design period to force this on people.
How many folks on this thread can correlate these failures with active Azure API rate limiting events?
@rsingh612, were you able to determine operational differences between your original failure events, and your successful (if manual) az vmss update-instances
remediation?
@jackfrancis we saw this:
failed with failed to get azure instance id for node "aks-e16sv3-79416110-vmss00000w" (compute.VirtualMachineScaleSetVMsClient#Get: Failure responding to request: StatusCode=429 -- Original Error: autorest/azure: Service returned an error. Status=429 Code="OperationNotAllowed" Message="The server rejected the request because too many requests have been received for this subscription." Details=[{"code":"TooManyRequests","message":"{\"operationGroup\":\"GetVMScaleSetVM30Min\",\"startTime\":\"2019-10-18T08:32:36.6723441+00:00\",\"endTime\":\"2019-10-18T08:47:36.6723441+00:00\",\"allowedRequestCount\":2500,\"measuredRequestCount\":2843}","target":"GetVMScaleSetVM30Min"}] InnerError={"internalErrorCode":"TooManyRequestsReceived"})
I assumed what happened was a vm extension autoupdated leading to the vmss enterring a failed state. While in a failed state disk attach detach operations fail but will keep retrying. My hypothesis was maybe because of the failed state the disks retried triggering the api limit, but maybe that is an incorrect hypothesis.
@feiskyer @khenidak will cloudprovider backoff v2 help recover from these events?
We are now getting this on a completely different cluster that has been up for months after resolving issue in another cluster in separate subscription. Are we 100% there was no roll out change? It seems suspicious with all the people here having same problem and over in:
https://github.com/Azure/AKS/issues/1278
Obviously something has had to change, is Microsoft looking into this?
It almost is if all the extension status for the VMSS have been lost all of a sudden.
Failed to upgrade virtual machine instance 'k8s-linuxpool1-13820493-vmss_1'. Error: Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained.
On the instance itself however it shows this:
vmssCSE (Microsoft.Azure.Extensions.CustomScript, 2.0.7)
Provisioning succeeded
Info
ProvisioningState/succeeded
Can't even enable boot diagnostics, going to try to slowly replace each instance.
We were able to add a node and that process seems to have fixed the scaleset at least for now. ^_^
We are actively working on a resolution for this issue.
Adding a node didn't work for us sadly.
@devigned Any idea on when you might expect a resolution to this and whether or not it will involve upgrading or migrating to a new aks / nodepool?
It seems like most of these issue are related to Azure API request limits being exceeded. It would be extremely helpful if everyone experiencing this issue could ping back with the following:
Thank you all. We will do our best to root cause this and get a fix out as soon as possible.
Number of nodes in your cluster? 30
Number of disks attached to those nodes? varies but we have 131 pvcs with 5 or so a nod Is cluster auto scaler enabled? no it is disabled What operation did you run when this issue was triggered? we didn't do anything. the cluster just stopped mounting pvcs and said there was an api limit. we then checked and saw the vmss was in a failed state due to vm extension RunCommandLinux and vmssCSE What version of aks-engine? aksEngineVersion : v0.39.2-aks
What version of K8s? 1.14.6
@sharkymcdongles give this a run please:
kubectl logs {{ your master named like: kube-controller-manager-k8s-master-XXXXXXX-0 }} \
-n kube-system | grep azure
Please report back any interesting rate limiting logs you find.
@devigned this is aks, I dont have access to the masters.
this is the affected vmss in azure:
/subscriptions/63ffcd66-455e-4369-87e6-a4a55983f44e/resourceGroups/MC_prod-eu-mip-aks-rg_euprod_westeurope/providers/Microsoft.Compute/virtualMachineScaleSets/aks-e16sv3-79416110-vmss
and aks:
/subscriptions/63ffcd66-455e-4369-87e6-a4a55983f44e/resourcegroups/prod-eu-mip-aks-rg/providers/Microsoft.ContainerService/managedClusters/euprod
@sharkymcdongles for AKS cluster issues, please open a support ticket and/or open an issue at https://github.com/Azure/AKS/issues.
There is an issue there already but here is the only place with engineers replying lol. Also it still uses aks engine even on aks hence the name so this bug is relevant.
On Fri, 18 Oct 2019, 9:13 PM Cecile Robert-Michon notifications@github.com wrote:
@sharkymcdongles https://github.com/sharkymcdongles for AKS cluster issues, please open a support ticket and/or open an issue at https://github.com/Azure/AKS/issues.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Azure/aks-engine/issues/1860?email_source=notifications&email_token=AG6KEUOCA6AYYO6QP4VZMX3QPIDELA5CNFSM4IPTUGQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBVSWOQ#issuecomment-543894330, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG6KEUOM47EOOD3454UR22LQPIDELANCNFSM4IPTUGQA .
one workaround is update vmss instance that has issue manually:
az vmss update-instances -g <RESOURCE_GROUP_NAME> --name <VMSS_NAME> --instance-id <ID(number)>
Let me know whether that works. And I have seen some customers using this way to workaround this issue.
@andyzhangx that was the first thing I tried. It fails with an internal execution error. I then tried deallocating one of the bad nodes but now it won't start with an internal execution error as well.
@jackfrancis
operational differences After failure ( service principal update)
After executing upgrade
Please detach the disk explicitly , it should work
az vmss disk detach -g MC_kube_xyz_abc --name aks-d8s-27745123-vmss --instance-id 0 --lun 0
where 0 is vm 0 so it can be 0,1,2 --name aks-d8s-27745123-vmss : vmss name
@abarqawi nice idea, but in my case still fails:
Deployment failed. Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained.
However, after a while (hours) the disk might get detached/attached successfully.
Joining to the club. Same issues with both vmssCE extension and stuck pods waiting for storage
Same here problem with vmssCE timing out
Same problem here.
The workaround above helped in our case (run update-instances for each instance, run reimage for instances where update-instances returns an error).
However, after the procedure, the azure portal still shows the VMSS as "failed" with the error message saying "Provisioning of VM extension 'vmssCSE' has timed out."
And now our third cluster with VMSS is doing this in a different subscription after this happened to two other clusters. Definitely starting to get kind of frustrating lol.
Can we get some sort of statement from Azure into what is going on? I think there are enough people this issue is happening to warrant some sort of explanation? I am super concerned about this.
Obviously know Azure folks are doing best just a bit concerned :)
Is anyone on this thread suffering these cluster symptoms where that cluster is the only cluster in a particular subscription? That would make reproducing the error state easier, thus accelerating root cause diagnosis.
@sylus, would you be up for jumping on a call and having a shared debugging session?
@devigned we have gotten our clusters back in a working state for now but super happy to have a chat on the problems we were having and provide any information that will assist you.
We are free from 2-4:30 EDT, or can find a time tomorrow if need :D
Thanks a bunch!
@andreygolev what version of Kubernetes is the cluster (or clusters) running that are suffering rate throttling?
Mine is 1.14.6
Same version here.
@sharkymcdongles If you're still in a sticky spot, you may try the following to gain cloudprovider improvements w/ respect to Azure API calls:
For each master node:
sudo vi /etc/kubernetes/manifests/kube-controller-manager.yaml
image: k8s.gcr.io/hyperkube-amd64:v1.14.6
and change to image: k8s.gcr.io/hyperkube-amd64:v1.14.8
There are several optimizations to VMSS-based Kubernetes resource reconciliations that are in 1.13.12, 1.14.8, 1.15.5, and 1.16.2. Those versions were only just added in an aks-engine release today:
https://github.com/Azure/aks-engine/releases/tag/v0.42.1
For folks who have a working aks-engine upgrade
workflow in place, we recommend getting those versions @ v0.42.1 of aks-engine. For folks managing cluster in active throttling situations, we don't recommend executing an aks-engine upgrade
, as that operation will itself require a non-throttled Azure API subscription in order to succeed. :(
An additional step-wise remediation is to apply a newer version of "exponential backoff" to the cloudprovider config by adding this line to /etc/kubernetes/azure.json
(again, on all master vms):
$ sudo cat /etc/kubernetes/azure.json | grep cloudProviderBackoffMode
"cloudProviderBackoffMode": "v2",
(The above output is from a test cluster running that config.)
The above configuration optimizes the number of calls that are stacked up and continually sent to the Azure APIs in HTTP 429 situations, helping most scenarios to back off more effectively, in order to make Azure API throttling symptoms go away sooner.
@sylus glad to hear your cluster is back in a working state! Thank you for the kind words and offer for additional info. We think we have narrowed down the issue and we are hoping https://github.com/Azure/aks-engine/issues/1860#issuecomment-544694755 will reduce the API throttling we think is behind these issues.
Furthermore, we have some serious API reductions upstream in https://github.com/kubernetes/cloud-provider-azure. As soon as these improvements stabilize, we'll integrate them into a future aks-engine release.
if you got vmssCSE time out
issue even after vmss update, you could use reimage
vmss instance to mitigate, pls make sure no pods are running on that instance before reimage(run kubectl cordon <agent-node-name>
before reimage)
Reimage virtual machine instance(s)
This action will reimage the selected instances, restoring them to their initial state. The instances will be restarted, and any local data will be lost. Do you want to reimage the selected instances?
vmssCSE time out
issue after vmss update
begin with aks-engine v0.40.0
, cluster-provision could not run twice, related PR: https://github.com/Azure/aks-engine/pull/1719, we should fix it in aks-engine to let vmssCSE could run multiple times after node is set up, otherwise we could not use az vm/vmss update
to update vmas/vmss instance any more. @jackfrancis cc @yangl900
What happened: VMSS status is set to 'failed' with error message - "Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained."
As a result of this, SLB does not allow for binding of service (type Load balancer) with a public IP resource. The service status is always Pending:
What you expected to happen: No CSE errors and the service should bind to a given public IP resource with no errors. How to reproduce it (as minimally and precisely as possible): No real steps. Happens at random when the cluster attempts to scale and add a new VM
Anything else we need to know?: AKS engine version is 0.28.1 Environment:
kubectl version
): 1.12.2While I am tempted to say this looks like a duplicate of #802, I would appreciate another look.