Azure / aks-engine

AKS Engine: legacy tool for Kubernetes on Azure (see status)
https://github.com/Azure/aks-engine
MIT License
1.03k stars 522 forks source link

Provisioning of VM extension 'vmssCSE' has timed out #1860

Closed vijaygos closed 4 years ago

vijaygos commented 5 years ago

What happened: VMSS status is set to 'failed' with error message - "Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained."

As a result of this, SLB does not allow for binding of service (type Load balancer) with a public IP resource. The service status is always Pending:

NAME...................TYPE..................... CLUSTER-IP.......EXTERNAL-IP......PORT(S) ........................................AGE

my-service-1---ClusterIP -------10.0.52.61-----<none> ---------8125/UDP -------------------57m

my-service-2---LoadBalancer---10.0.165.41 ---<pending>----80:31330/TCP,443:30354/TCP ---46m

What you expected to happen: No CSE errors and the service should bind to a given public IP resource with no errors. How to reproduce it (as minimally and precisely as possible): No real steps. Happens at random when the cluster attempts to scale and add a new VM

Anything else we need to know?: AKS engine version is 0.28.1 Environment:

While I am tempted to say this looks like a duplicate of #802, I would appreciate another look.

welcome[bot] commented 5 years ago

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

vijaygos commented 5 years ago

here is the output of the cluster-provision.log

++ date ++ hostname

Creating config file /etc/idmapd.conf with new version debconf: unable to initialize frontend: Dialog debconf: (TERM is not set, so the dialog frontend is not usable.) debconf: falling back to frontend: Readline

Creating config file /etc/default/nfs-common with new version Adding system user statd' (UID 113) ... sent invalidate(passwd) request, exiting sent invalidate(group) request, exiting Adding new userstatd' (UID 113) with group nogroup' ... sent invalidate(passwd) request, exiting sent invalidate(group) request, exiting sent invalidate(passwd) request, exiting sent invalidate(group) request, exiting sent invalidate(passwd) request, exiting sent invalidate(group) request, exiting sent invalidate(passwd) request, exiting sent invalidate(group) request, exiting sent invalidate(passwd) request, exiting sent invalidate(group) request, exiting sent invalidate(passwd) request, exiting sent invalidate(group) request, exiting Not creating home directory/var/lib/nfs'. nfs-utils.service is a disabled or a static unit, not starting it. Setting up zip (3.0-11) ... Setting up cgroup-lite (1.11) ... Setting up libipset3:amd64 (6.29-1) ... Setting up ipset (6.29-1) ... Setting up socat (1.7.3.1-1) ... Setting up libnss3-nssdb (2:3.28.4-0ubuntu0.16.04.6) ... Setting up libnss3:amd64 (2:3.28.4-0ubuntu0.16.04.6) ... Setting up librados2 (10.2.11-0ubuntu0.16.04.2) ... Setting up librbd1 (10.2.11-0ubuntu0.16.04.2) ... Setting up libcephfs1 (10.2.11-0ubuntu0.16.04.2) ... Setting up python-cephfs (10.2.11-0ubuntu0.16.04.2) ... Setting up python-rados (10.2.11-0ubuntu0.16.04.2) ... Setting up python-rbd (10.2.11-0ubuntu0.16.04.2) ... Setting up libradosstriper1 (10.2.11-0ubuntu0.16.04.2) ... Setting up librgw2 (10.2.11-0ubuntu0.16.04.2) ... Setting up ceph-common (10.2.11-0ubuntu0.16.04.2) ... [/usr/lib/tmpfiles.d/var.conf:14] Duplicate line for path "/var/log", ignoring. Processing triggers for libc-bin (2.23-0ubuntu10) ... Processing triggers for dbus (1.10.6-1ubuntu3.3) ... Processing triggers for ureadahead (0.100.0-19) ... Processing triggers for ca-certificates (20170717~16.04.2) ... Updating certificates in /etc/ssl/certs... 0 added, 0 removed; done. Running hooks in /etc/ca-certificates/update.d... done. Processing triggers for systemd (229-4ubuntu21.4) ...

Bridge chain: PREROUTING, entries: 0, policy: ACCEPT

Bridge chain: OUTPUT, entries: 0, policy: ACCEPT

Bridge chain: POSTROUTING, entries: 0, policy: ACCEPT

CecileRobertMichon commented 5 years ago

@vijaygos are you sure these are logs from an instance with a failure? These logs look like a successful run to me. (Custom script finished successfully)

vijaygos commented 5 years ago

@CecileRobertMichon , I can see the VMSS status as "Failed". However, I am not sure how to determine which is the "failed instance". Is there any way to find the failed instance from the portal?

CecileRobertMichon commented 5 years ago

@vijaygos I don't think the portal shows which instance failed unfortunately. We have an issue open to improve extension logs to print the hostname but in the meantime there's no easy way to get the instance ID that I know of. See https://github.com/Azure/aks-engine/issues/1496. If you are able to repro the issue with scaling a few instances at a time that might be the easiest way to know which instance to get the logs from.

CecileRobertMichon commented 5 years ago

Apparently you can also get to the extension status in the portal from the instance status:

MicrosoftTeams-image

vijaygos commented 5 years ago

Ah! Thanks @CecileRobertMichon for pointing me to that. This is rather strange. I have looked at the status for all the 31 VMs in our cluster and they all show "ProvisioningState/succeeded". However, the VMSS Overview page shows a Failed status. image

chpspiir commented 4 years ago

We are experiencing the exact same thing. Any updates on what could be wrong?

rsingh612 commented 4 years ago

Also, Experiencing the same issue after updating ServicePrincipal credentials 'Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained.'

sylus commented 4 years ago

Also getting this issue today though no changes from cluster perspective. We were getting a rate limit issue earlier.

jackfrancis commented 4 years ago

Is there a common pattern in terms of cluster operations? Are these new cluster buildouts? Or a result of scale events on existing clusters?

zachomedia commented 4 years ago

Answering for @sylus - it's an existing cluster. We haven't made changes to the scale set ourselves - the last scale operation was last week. We found the issue when we had teams reporting that pods with disks weren't coming up.

It now seems that it's unable to mount the disks because it can't unmount them from the old nodes.

We seemed to hit a subscription write limit earlier today - though I'm not sure if that's related to this issue or not (if it was retrying too often).

MoergJ commented 4 years ago

We experience the same issue. Existing cluster, no changes regarding scale sets, but pods can't mount data disks and keep hanging in 'Init'.

lorenzo-biava commented 4 years ago

Same here; it worked fine last evening and today no VM is able to attach any disks, displaying the Provisioning of VM extension 'vmssCSE' has timed out. failure state in the VMSS (the AKS cluster is in westeurope).

sylus commented 4 years ago

We filed an Azure support ticket and supposed to get a call back this morning will post back any info.

rsingh612 commented 4 years ago

I was able to restart our VMSS nodes gracefully by running the following az cli command. Previously restarting the vmss node using the GUI also resulted in Provisioning of VM extension 'vmssCSE' has timed out.

Command: az vmss update-instances -g {RESOUCE_GROUP} --name {VMSS_AGENTPOOL} --instance-id {INSTANCE_ID}

And the Cluster status resolved, All hung up Disks freed themselves and reattach without an issue.

jarobar435 commented 4 years ago

Unfortunately, in my case, the outcome of the above wasn't any different than the GUI triggered process:

$ az vmss update-instances -g {group} --name {vmss} --instance-id {id}
Deployment failed. Correlation ID: {id}. Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained.
sharkymcdongles commented 4 years ago

We have the same issue except it affects both vmssCSE and RunCommandLinux extensions. We didn't enable or disable any extensions. I tried to do the update command but it never completes successfully failing with this:

cli.azure.cli.core.util : Deployment failed. Correlation ID: abeff543-0a78-4c26-aa0f-dc2d01b71222. Provisioning of VM extension 'RunCommandLinux' has timed out. Extension installation may be taking too long, or extension status could not be obtained.
Deployment failed. Correlation ID: abeff543-0a78-4c26-aa0f-dc2d01b71222. Provisioning of VM extension 'RunCommandLinux' has timed out. Extension installation may be taking too long, or extension status could not be obtained.

I tried to reinstall the extension with force and it also failed. This is a critical bug and is affecting our production. How has this not been addressed? We don't need this extension. Nobody needs to run linux commands via a cli extension and this should not be there by default. It's a bad design period to force this on people.

jackfrancis commented 4 years ago

How many folks on this thread can correlate these failures with active Azure API rate limiting events?

@rsingh612, were you able to determine operational differences between your original failure events, and your successful (if manual) az vmss update-instances remediation?

sharkymcdongles commented 4 years ago

@jackfrancis we saw this:

failed with failed to get azure instance id for node "aks-e16sv3-79416110-vmss00000w" (compute.VirtualMachineScaleSetVMsClient#Get: Failure responding to request: StatusCode=429 -- Original Error: autorest/azure: Service returned an error. Status=429 Code="OperationNotAllowed" Message="The server rejected the request because too many requests have been received for this subscription." Details=[{"code":"TooManyRequests","message":"{\"operationGroup\":\"GetVMScaleSetVM30Min\",\"startTime\":\"2019-10-18T08:32:36.6723441+00:00\",\"endTime\":\"2019-10-18T08:47:36.6723441+00:00\",\"allowedRequestCount\":2500,\"measuredRequestCount\":2843}","target":"GetVMScaleSetVM30Min"}] InnerError={"internalErrorCode":"TooManyRequestsReceived"})

I assumed what happened was a vm extension autoupdated leading to the vmss enterring a failed state. While in a failed state disk attach detach operations fail but will keep retrying. My hypothesis was maybe because of the failed state the disks retried triggering the api limit, but maybe that is an incorrect hypothesis.

jackfrancis commented 4 years ago

@feiskyer @khenidak will cloudprovider backoff v2 help recover from these events?

sylus commented 4 years ago

We are now getting this on a completely different cluster that has been up for months after resolving issue in another cluster in separate subscription. Are we 100% there was no roll out change? It seems suspicious with all the people here having same problem and over in:

https://github.com/Azure/AKS/issues/1278

Obviously something has had to change, is Microsoft looking into this?

It almost is if all the extension status for the VMSS have been lost all of a sudden.

Failed to upgrade virtual machine instance 'k8s-linuxpool1-13820493-vmss_1'. Error: Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained.

On the instance itself however it shows this:

vmssCSE (Microsoft.Azure.Extensions.CustomScript, 2.0.7)
Provisioning succeeded
Info
ProvisioningState/succeeded

Can't even enable boot diagnostics, going to try to slowly replace each instance.

sylus commented 4 years ago

We were able to add a node and that process seems to have fixed the scaleset at least for now. ^_^

devigned commented 4 years ago

We are actively working on a resolution for this issue.

sharkymcdongles commented 4 years ago

Adding a node didn't work for us sadly.

@devigned Any idea on when you might expect a resolution to this and whether or not it will involve upgrading or migrating to a new aks / nodepool?

devigned commented 4 years ago

It seems like most of these issue are related to Azure API request limits being exceeded. It would be extremely helpful if everyone experiencing this issue could ping back with the following:

Thank you all. We will do our best to root cause this and get a fix out as soon as possible.

sharkymcdongles commented 4 years ago

Number of nodes in your cluster? 30

Number of disks attached to those nodes? varies but we have 131 pvcs with 5 or so a nod Is cluster auto scaler enabled? no it is disabled What operation did you run when this issue was triggered? we didn't do anything. the cluster just stopped mounting pvcs and said there was an api limit. we then checked and saw the vmss was in a failed state due to vm extension RunCommandLinux and vmssCSE What version of aks-engine? aksEngineVersion : v0.39.2-aks

What version of K8s? 1.14.6

devigned commented 4 years ago

@sharkymcdongles give this a run please:

kubectl logs {{ your master named like: kube-controller-manager-k8s-master-XXXXXXX-0 }} \
-n kube-system | grep azure

Please report back any interesting rate limiting logs you find.

sharkymcdongles commented 4 years ago

@devigned this is aks, I dont have access to the masters.

this is the affected vmss in azure:

/subscriptions/63ffcd66-455e-4369-87e6-a4a55983f44e/resourceGroups/MC_prod-eu-mip-aks-rg_euprod_westeurope/providers/Microsoft.Compute/virtualMachineScaleSets/aks-e16sv3-79416110-vmss

and aks:

/subscriptions/63ffcd66-455e-4369-87e6-a4a55983f44e/resourcegroups/prod-eu-mip-aks-rg/providers/Microsoft.ContainerService/managedClusters/euprod

CecileRobertMichon commented 4 years ago

@sharkymcdongles for AKS cluster issues, please open a support ticket and/or open an issue at https://github.com/Azure/AKS/issues.

sharkymcdongles commented 4 years ago

There is an issue there already but here is the only place with engineers replying lol. Also it still uses aks engine even on aks hence the name so this bug is relevant.

On Fri, 18 Oct 2019, 9:13 PM Cecile Robert-Michon notifications@github.com wrote:

@sharkymcdongles https://github.com/sharkymcdongles for AKS cluster issues, please open a support ticket and/or open an issue at https://github.com/Azure/AKS/issues.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Azure/aks-engine/issues/1860?email_source=notifications&email_token=AG6KEUOCA6AYYO6QP4VZMX3QPIDELA5CNFSM4IPTUGQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBVSWOQ#issuecomment-543894330, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG6KEUOM47EOOD3454UR22LQPIDELANCNFSM4IPTUGQA .

andyzhangx commented 4 years ago

one workaround is update vmss instance that has issue manually:

az vmss update-instances -g <RESOURCE_GROUP_NAME> --name <VMSS_NAME> --instance-id <ID(number)>

Let me know whether that works. And I have seen some customers using this way to workaround this issue.

sharkymcdongles commented 4 years ago

@andyzhangx that was the first thing I tried. It fails with an internal execution error. I then tried deallocating one of the bad nodes but now it won't start with an internal execution error as well.

rsingh612 commented 4 years ago

@jackfrancis

operational differences After failure ( service principal update)

After executing upgrade

abarqawi commented 4 years ago

Please detach the disk explicitly , it should work

az vmss disk detach -g MC_kube_xyz_abc --name aks-d8s-27745123-vmss --instance-id 0 --lun 0

where 0 is vm 0 so it can be 0,1,2 --name aks-d8s-27745123-vmss : vmss name

lorenzo-biava commented 4 years ago

@abarqawi nice idea, but in my case still fails:

Deployment failed. Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained.

However, after a while (hours) the disk might get detached/attached successfully.

andreygolev commented 4 years ago

Joining to the club. Same issues with both vmssCE extension and stuck pods waiting for storage

ahmedspiir commented 4 years ago

Same here problem with vmssCE timing out

acm-073 commented 4 years ago

Same problem here.

The workaround above helped in our case (run update-instances for each instance, run reimage for instances where update-instances returns an error).

However, after the procedure, the azure portal still shows the VMSS as "failed" with the error message saying "Provisioning of VM extension 'vmssCSE' has timed out."

sylus commented 4 years ago

And now our third cluster with VMSS is doing this in a different subscription after this happened to two other clusters. Definitely starting to get kind of frustrating lol.

sylus commented 4 years ago

Can we get some sort of statement from Azure into what is going on? I think there are enough people this issue is happening to warrant some sort of explanation? I am super concerned about this.

Obviously know Azure folks are doing best just a bit concerned :)

jackfrancis commented 4 years ago

Is anyone on this thread suffering these cluster symptoms where that cluster is the only cluster in a particular subscription? That would make reproducing the error state easier, thus accelerating root cause diagnosis.

devigned commented 4 years ago

@sylus, would you be up for jumping on a call and having a shared debugging session?

sylus commented 4 years ago

@devigned we have gotten our clusters back in a working state for now but super happy to have a chat on the problems we were having and provide any information that will assist you.

We are free from 2-4:30 EDT, or can find a time tomorrow if need :D

Thanks a bunch!

jackfrancis commented 4 years ago

@andreygolev what version of Kubernetes is the cluster (or clusters) running that are suffering rate throttling?

sharkymcdongles commented 4 years ago

Mine is 1.14.6

lorenzo-biava commented 4 years ago

Same version here.

jackfrancis commented 4 years ago

@sharkymcdongles If you're still in a sticky spot, you may try the following to gain cloudprovider improvements w/ respect to Azure API calls:

For each master node:

There are several optimizations to VMSS-based Kubernetes resource reconciliations that are in 1.13.12, 1.14.8, 1.15.5, and 1.16.2. Those versions were only just added in an aks-engine release today:

https://github.com/Azure/aks-engine/releases/tag/v0.42.1

For folks who have a working aks-engine upgrade workflow in place, we recommend getting those versions @ v0.42.1 of aks-engine. For folks managing cluster in active throttling situations, we don't recommend executing an aks-engine upgrade, as that operation will itself require a non-throttled Azure API subscription in order to succeed. :(

An additional step-wise remediation is to apply a newer version of "exponential backoff" to the cloudprovider config by adding this line to /etc/kubernetes/azure.json (again, on all master vms):

$ sudo cat /etc/kubernetes/azure.json | grep cloudProviderBackoffMode
    "cloudProviderBackoffMode": "v2",

(The above output is from a test cluster running that config.)

The above configuration optimizes the number of calls that are stacked up and continually sent to the Azure APIs in HTTP 429 situations, helping most scenarios to back off more effectively, in order to make Azure API throttling symptoms go away sooner.

devigned commented 4 years ago

@sylus glad to hear your cluster is back in a working state! Thank you for the kind words and offer for additional info. We think we have narrowed down the issue and we are hoping https://github.com/Azure/aks-engine/issues/1860#issuecomment-544694755 will reduce the API throttling we think is behind these issues.

Furthermore, we have some serious API reductions upstream in https://github.com/kubernetes/cloud-provider-azure. As soon as these improvements stabilize, we'll integrate them into a future aks-engine release.

andyzhangx commented 4 years ago

if you got vmssCSE time out issue even after vmss update, you could use reimage vmss instance to mitigate, pls make sure no pods are running on that instance before reimage(run kubectl cordon <agent-node-name> before reimage)

Reimage virtual machine instance(s)
This action will reimage the selected instances, restoring them to their initial state. The instances will be restarted, and any local data will be lost. Do you want to reimage the selected instances?

cc @yangl900