Azure / aks-engine

AKS Engine: legacy tool for Kubernetes on Azure (see status)
https://github.com/Azure/aks-engine
MIT License
1.03k stars 522 forks source link

Provisioning of VM extension 'vmssCSE' has timed out #1860

Closed vijaygos closed 4 years ago

vijaygos commented 5 years ago

What happened: VMSS status is set to 'failed' with error message - "Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained."

As a result of this, SLB does not allow for binding of service (type Load balancer) with a public IP resource. The service status is always Pending:

NAME...................TYPE..................... CLUSTER-IP.......EXTERNAL-IP......PORT(S) ........................................AGE

my-service-1---ClusterIP -------10.0.52.61-----<none> ---------8125/UDP -------------------57m

my-service-2---LoadBalancer---10.0.165.41 ---<pending>----80:31330/TCP,443:30354/TCP ---46m

What you expected to happen: No CSE errors and the service should bind to a given public IP resource with no errors. How to reproduce it (as minimally and precisely as possible): No real steps. Happens at random when the cluster attempts to scale and add a new VM

Anything else we need to know?: AKS engine version is 0.28.1 Environment:

While I am tempted to say this looks like a duplicate of #802, I would appreciate another look.

jackfrancis commented 4 years ago

@andyzhangx thanks for that diagnosis, a fix to that has been included w/ v0.42.2, just published:

https://github.com/Azure/aks-engine/releases/tag/v0.42.2

lundsec commented 4 years ago

And do we run a aks-engine upgrade to get our vmss on the new version?

jackfrancis commented 4 years ago

Hi @therock, aks-engine upgrade would work. A more lightweight approach would be to use v0.42.2 of aks-engine and run aks-engine scale against the desired vmss-backed pool, giving it a n+1 count compared to the existing scaleset size. That would update the model for the scaleset, and then you could run "update model" against a single instance. I think that's safer than running upgrade, which will re-pave everything.

ahmedspiir commented 4 years ago

Hi @jackfrancis When do i'll have it on azure? because i can see with the VMSs tag that i'm running version "aksEngineVersion : v0.41.4-aks". Also is there a way to get the fix faster into my cluster?

lundsec commented 4 years ago

After running the aks-engine scale with v0.42.2 it fails with the same error

Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained.

Error: Code="DeploymentFailed" Message="At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details." Details=[{"code":"Conflict","message":"{
  "status": "Failed",
  "error": {
    "code": "ResourceDeploymentFailure",
    "message": "The resource operation completed with terminal provisioning state 'Failed'.",
    "details": [
      {
        "code": "VMExtensionProvisioningTimeout",
        "message": "Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained."
      }
    ]
  }
}"}]
acm-073 commented 4 years ago

Hi @jackfrancis When do i'll have it on azure? because i can see with the VMSs tag that i'm running version "aksEngineVersion : v0.41.4-aks". Also is there a way to get the fix faster into my cluster?

Same question here - we're running into the very problems described here on a managed AKS cluster, so the question is when the fixes described here will be available with managed AKS.

lundsec commented 4 years ago

After the failed aks-engine scale with v0.42.2 I did try aks-engine upgrade from version 1.16.0 to 1.16.2 and it failed at the last step Finished ARM Deployment with the same error

Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained.

The upgraded master-0 node works fine even with that failed error so i ran it again 4 more times for the rest of the masters. Now all masters are upgraded to version 1.16.2 and it was suppose todo the agent scale sets but failed 20 secs later

INFO[0012] Starting upgrade of master nodes...
INFO[0012] masterNodesInCluster: 5
INFO[0012] Master VM: k8s-master-33419389-0 is upgraded to expected orchestrator version
INFO[0012] Master VM: k8s-master-33419389-1 is upgraded to expected orchestrator version
INFO[0012] Master VM: k8s-master-33419389-2 is upgraded to expected orchestrator version
INFO[0012] Master VM: k8s-master-33419389-3 is upgraded to expected orchestrator version
INFO[0012] Master VM: k8s-master-33419389-4 is upgraded to expected orchestrator version
INFO[0012] Expected master count: 5, Creating 0 more master VMs
INFO[0012] Deploying the agent scale sets ARM template...
INFO[0012] Starting ARM Deployment agentscaleset-19-10-23T12.06.49-1177339277 in resource group xxxxxx. This will take some time...
INFO[0040] Finished ARM Deployment (agentscaleset-19-10-23T12.06.49-1177339277). Error: Code="DeploymentFailed" Message="At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details." Details=[{"code":"Conflict","message":"{
  \"status\": \"Failed\",
  \"error\": {
    \"code\": \"ResourceDeploymentFailure\",
    \"message\": \"The resource operation completed with terminal provisioning state 'Failed'.\",
    \"details\": [
      {
        \"code\": \"VMExtensionProvisioningTimeout\",
        \"message\": \"Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained.\"
      }
    ]
  }
}"}]
ERRO[0040] error applying upgrade template in upgradeAgentScaleSets: Code="DeploymentFailed" Message="At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details." Details=[{"code":"Conflict","message":"{
  \"status\": \"Failed\",
  \"error\": {
    \"code\": \"ResourceDeploymentFailure\",
    \"message\": \"The resource operation completed with terminal provisioning state 'Failed'.\",
    \"details\": [
      {
        \"code\": \"VMExtensionProvisioningTimeout\",
        \"message\": \"Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained.\"
      }
    ]
  }
}"}]
INFO[0040] Resuming cluster autoscaler, replica count: 1
Error: upgrading cluster: Code="DeploymentFailed" Message="At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details." Details=[{"code":"Conflict","message":"{
  \"status\": \"Failed\",
  \"error\": {
    \"code\": \"ResourceDeploymentFailure\",
    \"message\": \"The resource operation completed with terminal provisioning state 'Failed'.\",
    \"details\": [
      {
        \"code\": \"VMExtensionProvisioningTimeout\",
        \"message\": \"Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained.\"
      }
    ]
  }
}"}]
zachomedia commented 4 years ago

@jackfrancis We upgraded to 1.15.5 with aks-engine 0.42.2 and enabled the v2 backoff. But it seems that when it hits a rate limit scenario it just hammers the Azure API and the only way to recover is to turn off controller-manager for a while to clear it

For example:

The server rejected the request because too many requests have been received for this subscription. (Code: OperationNotAllowed) {"operationgroup":"HighCostGetVMScaleSet30Min","starttime":"2019-10-23T14:18:11.960853+00:00","endtime":"2019-10-23T14:33:11.960853+00:00","allowedrequestcount":900,"measuredrequestcount":3157} (Code: TooManyRequests, Target: HighCostGetVMScaleSet30Min)

(We also hit the HighCostGetVMScaleSet3Min one too)

AustinSmart commented 4 years ago

Will the fixes being addressed over at the AKS issue be applicable here?

sylus commented 4 years ago

Yeah as @zachomedia said it appears worse and we barely get 15-20 mins of proper Kubernetes operation before rate limited again. @devigned @jackfrancis can we arbitratrily increase our limit?

Our clients are getting pretty insistent lol at the state of things this week and is putting a lot of pressure on us. Really worried will result in us having to move workloads somewhere else maybe not with VMSS but love the rolling upgrades so don't particularly want. Have been stalled for the past 2-3 days. We can maybe increase our support ticket to priority.

Note: Do seem to be able to find the disk potentially causing problem but lot of moving parts so trying to further isolate.

devigned commented 4 years ago

/cc @aramase for insight into the Cloud Provider for Azure which is what's issuing the calls to ARM.

jackfrancis commented 4 years ago

For folks who are having issues w/ VMSS CSE timeouts (not necessarily related to throttling), there has been an identified CSE bug being triaged. This bug correlates with folks experiencing this issue last week. (This CSE bug has nothing to do w/ AKS Engine's CSE script(s).)

If you have one or more VMSS instances in this state, please try manually re-imaging the instance. We've seen that workaround help restore nodes for folks.

And please report back if that unblocks you.

And apologies. :(

zachomedia commented 4 years ago

Ok, so I managed to get our cluster back into a good state for now. Seems that running most operations that cause a disk unmount can trigger the problem again (@sylus can fill in more there)

Basically to find out which disk is stuck, I looked into the controller-manager logs and saw a bunch of:

I1023 20:02:29.782144       1 attacher.go:89] Attach volume "/subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/disks/k8s-cancentral-01-development-dyna-pvc-$PVC_NAME" to instance "k8s-linuxpool1-12345678-vmss00000c" failed with disk(/subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/disks/k8s-cancentral-01-development-dyna-pvc-$PVC_NAME) already attached to node(/subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-12345678-vmss/virtualMachines/k8s-linuxpool1-12345678-vmss_10), could not be attached to node(k8s-linuxpool1-12345678-vmss00000c)
I1023 20:02:29.782187       1 actual_state_of_world.go:322] Volume "kubernetes.io/azure-disk//subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/disks/k8s-cancentral-01-development-dyna-pvc-$PVC_NAME" is already added to attachedVolume list to node "k8s-linuxpool1-12345678-vmss_10", update device path ""
E1023 20:02:29.782390       1 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/azure-disk//subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/disks/k8s-cancentral-01-development-dyna-pvc-$PVC_NAME\"" failed. No retries permitted until 2019-10-23 20:02:30.282260017 +0000 UTC m=+391.358857249 (durationBeforeRetry 500ms). Error: "AttachVolume.Attach failed for volume \"pvc-$PVC_NAME\" (UniqueName: \"kubernetes.io/azure-disk//subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/disks/k8s-cancentral-01-development-dyna-pvc-$PVC_NAME\") from node \"k8s-linuxpool1-12345678-vmss00000c\" : disk(/subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/disks/k8s-cancentral-01-development-dyna-pvc-$PVC_NAME) already attached to node(/subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-12345678-vmss/virtualMachines/k8s-linuxpool1-12345678-vmss_10), could not be attached to node(k8s-linuxpool1-12345678-vmss00000c)"
I1023 20:02:29.782447       1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"adp", Name:"appname-7587887f5d-8pndl", UID:"2b777c28-859e-45cd-984f-af2734a436a5", APIVersion:"v1", ResourceVersion:"10305952", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' AttachVolume.Attach failed for volume "pvc-$PVC_NAME" : disk(/subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/disks/k8s-cancentral-01-development-dyna-pvc-$PVC_NAME) already attached to node(/subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-12345678-vmss/virtualMachines/k8s-linuxpool1-12345678-vmss_10), could not be
attached to node(k8s-linuxpool1-12345678-vmss00000c)

To fix it, I ran az vmss list-instances --resource-group k8s-cancentral-01-dev-rg --name k8s-linuxpool1-12345678-vmss --query '[].[name, storageProfile.dataDisks[]]' | less, found the disk and got the lun number. I then ran az vmss disk detach --resource-group k8s-cancentral-01-dev-rg --vmss-name k8s-linuxpool1-12345678-vmss --instance-id 12 --lun 3 to detach it.

Once it was detached, the cluster slowly started to recover.

sylus commented 4 years ago

Kubernetes Version: v1.15.5 (1 master, 4 nodes) AKS Engine: v0.42.2 Kernel Version: 5.0.0-1023-azure OS Image: Ubuntu 18.04.3 LTS Date Issue Started: Last week (Oct 15, 2019) on clusters that running w/out issue for 3 months

Note: We don't seem to have the CSE issue anymore but still keeping this here for the moment since all seems related. We did also add as @JackFrancis indicated cloudProviderBackoffMode: v2 and removed the cloudProviderBackoffExponent and cloudProviderBackoffJitter.

Re: The comment above. The disk (un)mount does seem to be the problem. I was able to reproduce a base case by just deleting a few pods and / or running a helm upgrade deployment and then this started to trigger the following errors related to a disk unmount almost right away.

I1023 23:49:18.644247       1 pv_controller.go:1270] isVolumeReleased[pvc-b2220d20-9e95-4973-923d-95cc6e49ff4c]: volume is released
I1023 23:49:18.644258       1 pv_controller.go:1270] isVolumeReleased[pvc-653fdf5a-7408-460e-89d3-fdd0a6dd5fdf]: volume is released
E1023 23:49:19.821907       1 goroutinemap.go:150] Operation for "delete-pvc-b2220d20-9e95-4973-923d-95cc6e49ff4c[52a16d3e-b5dd-4cc1-a64e-b03f6d61948b]" failed. No retries permitted until 2019-10-23 23:51:21.821868366 +0000 UTC m=+12222.629161185 (durationBeforeRetry 2m2s). Error: "compute.DisksClient#Delete: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code=\"OperationNotAllowed\" Message=\"Disk k8s-cancentral-01-development-dyna-pvc-b2220d20-9e95-4973-923d-95cc6e49ff4c is attached to VM /subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-28391316-vmss/virtualMachines/k8s-linuxpool1-28391316-vmss_12.\""
E1023 23:49:19.826714       1 goroutinemap.go:150] Operation for "delete-pvc-653fdf5a-7408-460e-89d3-fdd0a6dd5fdf[4c641582-1171-4c98-8189-29185623fc1c]" failed. No retries permitted until 2019-10-23 23:51:21.826677075 +0000 UTC m=+12222.633969894 (durationBeforeRetry 2m2s). Error: "compute.DisksClient#Delete: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code=\"OperationNotAllowed\" Message=\"Disk k8s-cancentral-01-development-dyna-pvc-653fdf5a-7408-460e-89d3-fdd0a6dd5fdf is attached to VM /subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-28391316-vmss/virtualMachines/k8s-linuxpool1-28391316-vmss_12.\""

The suspicion is as we have multiple teams (re)-deploying their apps etc that enough of these disk failure issues eventually make us hit the rate limits set by Azure. Therefore other operations against the VMSS don't succeed once it gets into this state. Then as mentioned above we need to stop the controller-manager pod for a while to clear the rate limit as illustrated below.

The server rejected the request because too many requests have been received for this subscription. (Code: OperationNotAllowed) {"operationgroup":"HighCostGetVMScaleSet30Min","starttime":"2019-10-23T14:18:11.960853+00:00","endtime":"2019-10-23T14:33:11.960853+00:00","allowedrequestcount":900,"measuredrequestcount":3157} (Code: TooManyRequests, Target: HighCostGetVMScaleSet30Min)

In the controller-manager it will list the PVC and instance ID of the disk that can't be unattached and we can use the PVC number to find the LUN using the command below.

az vmss list-instances --resource-group k8s-cancentral-01-dev-rg --name k8s-linuxpool1-12345678-vmss --query '[].[name, storageProfile.dataDisks[]]' 

We then have to for ALL disks listed in controller manager that have this problem need to run the following.

az vmss disk detach --resource-group k8s-cancentral-01-dev-rg --vmss-name k8s-linuxpool1-12345678-vmss --instance-id $ID --lun $LUN

The cluster is then now back in a working state until the next deployment which will trigger the PVC issue and then rinse / repeat ^_^

Related Issues

a) All this is explained in further detail over at AKS issue by Microsoft: https://github.com/Azure/AKS/issues/1278#issuecomment-545234688

b) Additionally this looks directly related as well although we have a order of magnitude smaller cluster: https://github.com/kubernetes/cloud-provider-azure/issues/247

jackfrancis commented 4 years ago

@sylus @zachomedia in your failure scenarios, are you ever encountering this error:

Cannot attach data disk '<disk_id>' to VM '<vmss_id>' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again."

And if so, are you observing that the disk it's complaining about seems, in fact, to be totally unattached?

We encountered this with another customer, and were able to follow your guidance to manually detach the offending disk (even though it wasn't attached to anything! — we detached it from the vmss instance id that it was trying to attach itself to 🤷‍♂ ).

In any event, FYI for folks continuing to struggle with this.

zachomedia commented 4 years ago

@jackfrancis Yeah, we've seen that error in our logs too. I don't think we've ever checked if it was actually attached or not, usually we just detach it through the cli.

We did have a weird state today where apparently one of our instances had a disk attached that no longer existed so all other disk operations failed. Once we removed that attachment, it recovered.

jackfrancis commented 4 years ago

@zachomedia How did you get the lun number in the case where the disk is not actually attached to any VMSS instances?

In our troubleshooting the following command didn't yield the lun in such a scenario:

az vmss list-instances --resource-group <rg> --name <vmss_name> --query '[].[name, storageProfile.dataDisks[]]'
zachomedia commented 4 years ago

@jackfrancis Oh, I see, for all of our cases the disk was in the list. So I guess that means it was attached.

jackfrancis commented 4 years ago

@sylus @zachomedia do you think this is an appropriate repro?:

1) Install this statefulset on a cluster w/ 50 VMSS nodes:

$ cat statefulset.yaml 
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  selector:
    matchLabels:
      app: nginx
  serviceName: "nginx"
  podManagementPolicy: "Parallel"
  replicas: 50
  template:
    metadata:
      labels:
        app: nginx
    spec:
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "managed-standard"
      resources:
        requests:
          storage: 1Gi

(note the ratio of nodes-replicas is 1:1)

2) When pods come online, start deleting them 3) Statefulset reconciliation will spin up more pods to fulfill replica count

I wonder if the above will induce the weird zombie detach state we're seeing.

zachomedia commented 4 years ago

@jackfrancis

I would say that you should have a pretty reasonable chance at reproducing the issue with that setup. Our cluster is much smaller (about 5 nodes) and usually just a couple of pods with disks being deleted can trigger it.

jackfrancis commented 4 years ago

@zachomedia Node count is static? I.e., disk re-attachment operations aren't happening as a result of underlying VMSS instances disappearing, re-appearing, etc?

zachomedia commented 4 years ago

@jackfrancis That's correct, node count is static.

jackfrancis commented 4 years ago

(So far unable to repro, but will keep trying.)

It's also possible that disk detach/attach operations during throttle events are the edge case causing this behavior (my test cluster is not being actively throttled atm).

zachomedia commented 4 years ago

@jackfrancis So something you can try.. Seems most of our problems now stem from PVCs being deleted (one of our teams deletes their deployments and re-recreates it right now). We seem to get two things:

  1. A disk is still attached to a VMSS instance but apparently no longer exists. No disk operations succeed until we detach that disk (the error appears when we attempt to manually detach a different disk that was failing - sorry, I forgot to save the message)
  2. Some deletes seem to never unmount the disk (maybe it passed some timeouts and gave up trying to unmount). See a bunch of:
I1025 15:21:21.911563       1 pv_controller.go:1270] isVolumeReleased[pvc-PVC1]: volume is released
I1025 15:21:21.911914       1 pv_controller.go:1270] isVolumeReleased[pvc-PVC2]: volume is released
I1025 15:21:21.913827       1 pv_controller.go:1270] isVolumeReleased[pvc-PVC3]: volume is released
E1025 15:21:23.028284       1 goroutinemap.go:150] Operation for "delete-pvc-PVC2[5dcfd7e1-e61b-4206-9cac-6e7758e3366c]" failed. No retries permitted until 2019-10-25 15:23:25.028194036 +0000 UTC m=+154545.835486755 (durationBeforeRetry 2m2s). Error: "compute.DisksClient#Delete: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code=\"OperationNotAllowed\" Message=\"Disk k8s-cancentral-01-development-dyna-pvc-PVC2 is attached to VM /subscriptions/<subscription>/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-12345678-vmss/virtualMachines/k8s-linuxpool1-12345678-vmss_12.\""
E1025 15:21:23.029986       1 goroutinemap.go:150] Operation for "delete-pvc-PVC3[3507cbda-868e-4198-b771-620234d258b5]" failed. No retries permitted until 2019-10-25 15:23:25.029908242 +0000 UTC m=+154545.837201061 (durationBeforeRetry 2m2s). Error: "compute.DisksClient#Delete: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code=\"OperationNotAllowed\" Message=\"Disk k8s-cancentral-01-development-dyna-pvc-PVC3 is attached to VM /subscriptions/<subscription>/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-12345678-vmss/virtualMachines/k8s-linuxpool1-12345678-vmss_12.\""
E1025 15:21:23.069595       1 goroutinemap.go:150] Operation for "delete-pvc-PVC1[4d9dbfa3-1dba-4478-9eae-f475d033257a]" failed. No retries permitted until 2019-10-25 15:23:25.069557061 +0000 UTC m=+154545.876849880 (durationBeforeRetry 2m2s). Error: "compute.DisksClient#Delete: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code=\"OperationNotAllowed\" Message=\"Disk k8s-cancentral-01-development-dyna-pvc-PVC1 is attached to VM /subscriptions/<subscription>/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-12345678-vmss/virtualMachines/k8s-linuxpool1-12345678-vmss_12.\""
andyzhangx commented 4 years ago

@zachomedia You need to drain k8s-linuxpool1-12345678-vmss_12 and then reimage k8s-linuxpool1-12345678-vmss_12 to clean that vmss instance state as a workaround.

sylus commented 4 years ago

Problems again today with 5 of our 6 VM's in a failed state and can't even reimage as get this error. I also did launch a few AKS clusters over the weekend and as soon as turned off over night, they all are in failed state with disk issues. Really hoping a fix is forthcoming, this is plainly reproducible.

Failed to reimage virtual machine instances k8s-linuxpool1-12345678-vmss_12, k8s-linuxpool1-12345678-vmss_10, k8s-linuxpool1-12345678-vmss_9, k8s-linuxpool1-12345678-vmss_11. Error: The processing of VM 'k8s-linuxpool1-12345678-vmss_10' is halted because of one or more disk processing errors encountered by VM 'k8s-linuxpool1-12345678-vmss_12' in the same Availability Set. Please resolve the error with VM 'k8s-linuxpool1-12345678-vmss_12' before retrying the operation.
jackfrancis commented 4 years ago

@sylus is the StatefulSet spec here not a viable repro input for inducing this symptom on a test cluster?

https://github.com/Azure/aks-engine/issues/2221

As the issue describes I was able to witness some badness (described in the issue w/ the working remediation steps I came up with at the time), but I haven't been able to reliably repeat all the badness so many folks are seeing now. Would love to get that repro process so that we can more effectively help drive fixes.

Thanks for hanging in there. :/

craiglpeters commented 4 years ago

A VMSS bug in the update on Oct 17 was identified and remediated globally over Oct 28 and 29. Disks should no longer be stuck in the 'detaching' state in VMSS, and so any Kubernetes operations should now be able to proceed without running into this issue. If you observe any new instances of this same problem please reopen this bug and I'll work to determine the cause.

mbarry-msdn commented 2 years ago

We're getting error with version 1.22.6: "MessageProvisioning of VM extension vmssCSE has timed out. Extension provisioning has taken too long to complete. The extension last reported "Plugin enabled". More information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshootTimeMonday,"

Lu1234 commented 1 year ago

We're getting error with version 1.22.6: "MessageProvisioning of VM extension vmssCSE has timed out. Extension provisioning has taken too long to complete. The extension last reported "Plugin enabled". More information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshootTimeMonday,"

I'm facing the same error while nothing being changed to the cluster,really need some advice with this.

sharkymcdongles commented 1 year ago

Advice is switch to a better cloud provider. Our entire company had to switch because of this issue 3 years ago. Surprised they still have these problems.

On Thu, 3 Nov 2022, 02:01 Lu1234, @.***> wrote:

We're getting error with version 1.22.6: "MessageProvisioning of VM extension vmssCSE has timed out. Extension provisioning has taken too long to complete. The extension last reported "Plugin enabled". More information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshootTimeMonday,"

I'm facing the same error while nothing being changed to the cluster,really need some advice with this.

— Reply to this email directly, view it on GitHub https://github.com/Azure/aks-engine/issues/1860#issuecomment-1301534210, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG6KEUKCCUAEGWR4SJ7TQXLWGMFFTANCNFSM4IPTUGQA . You are receiving this because you were mentioned.Message ID: @.***>

CecileRobertMichon commented 1 year ago

Hi all, I suggest opening a new issue in https://github.com/Azure/AKS/issues with details of the problem/error you are facing.

I want to make sure you're getting the help you need. This is a closed issue from 3 years ago in a deprecated project (https://github.com/Azure/aks-engine#project-status) so commenting on here likely won't get the right people to look into it.