Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 306 forks source link

Stop AKS cluster does no properly stop the cluster #1871

Closed vermegi closed 3 years ago

vermegi commented 4 years ago

I used the AKS start/stop preview feature using guidance at https://docs.microsoft.com/en-us/azure/aks/start-stop-cluster, however my cluster was not stopped correctly.

When I query the state of the cluster through az aks show the powerstate property reports a value of 'null'.

I cannot stop the cluster a second time. This gives a BadRequest error:

ValidationError: Operation failed with status: 'Bad Request'. Details: managed cluster is in (Succeeded,Stopped) state, stopping cannot be performed

I cannot start the cluster again:

ValidationError: Operation failed with status: 'Bad Request'. Details: Client Error: Availability Sets Not Supported

The worker node virtual machine associated with my cluster is also still in a running state.

What you expected to happen:

Cluster to go in a proper stopped state and my worker node(s) to be in a stopped state. If it turns out the version I am running is too old for stopping the cluster I would expect a proper error message to say first upgrade your cluster to the latest version before you issue a stop command.

How to reproduce it (as minimally and precisely as possible):

Create a 1 node cluster with version 1.16.9. Run through https://docs.microsoft.com/en-us/azure/aks/start-stop-cluster steps.

Environment:

ghost commented 4 years ago

Hi vermegi, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

NickKeller commented 4 years ago

Hi there! Looking at the error message when you try to start the cluster, it looks like your cluster is using Availability Set nodepools. Would you be able to confirm that for me?

leovms commented 4 years ago

I'm experiencing the same problem when I try to start the cluster and I can confirm that in my case I am using availability sets.

jeanfrancoislarente commented 4 years ago

I'm using VirtualMachineScaleSet and running into the issue:

managed cluster is in (Succeeded,Stopped) state, starting cannot be performed when trying to start the cluster.

k8s v. 1.19.0

palma21 commented 4 years ago

This feature is only supported for VMSS-based clusters. This should be the case for all new features in AKS.

@jeanfrancoislarente 1.19 just launched in preview so there might be something there, are you able to open a support ticket and share the ticket number here? Otherwise send me a DM with your cluster details.

ghost commented 4 years ago

Hi there :wave: AKS bot here. This issue has been tagged as needing a support request so that the AKS support and engineering teams have a look into this particular cluster/issue.

Follow the steps here to create a support ticket for Azure Kubernetes Service and the cluster discussed in this issue.

Please do mention this issue in the case description so our teams can coordinate to help you.

Thank you!

mfabiani-av commented 4 years ago

I'm running a private VMSS-based cluster with K8s 1.16, same issue: aks stop results in a "null" status and then it cannot be restarted. This is the command I use to create the cluster:

az aks create --resource-group $RG_NAME --name $CLUSTER_NAME \
  --kubernetes-version $VERSION \
  --location $LOCATION \
  --subscription $SUBSCRIPTION \
  --enable-private-cluster \
  --generate-ssh-keys \
  --node-vm-size $NODE_SIZE \
  --load-balancer-sku standard \
  --node-count $NODE_COUNT --node-osdisk-size $NODE_DISK_SIZE \
  --network-plugin $CNI_PLUGIN \
  --vnet-subnet-id $AKS_SNET_ID \
  --docker-bridge-address 172.17.0.1/16 \
  --dns-service-ip 10.2.0.10 \
  --service-cidr 10.2.0.0/24
mcurko commented 4 years ago

Kubernetes 1.17.11 with VMSS - same problem!

palma21 commented 4 years ago

@mfabiani-av private clusters are not supported during public preview. We working to have it even before GA, it's next on our plate. https://docs.microsoft.com/en-us/azure/aks/start-stop-cluster#limitations

@mcurko is your case a private cluster as well by any chance?

mcurko commented 4 years ago

What qualifies a cluster to be private? I didn't know there was a distinction between "private" and (I guess) "public" clusters.

vermegi commented 4 years ago

A private cluster is one that has it's API plane not publicly available, but basically uses private link to make it available in your AKS subnet. Described at https://docs.microsoft.com/en-us/azure/aks/private-clusters

ghost commented 4 years ago

Case being worked with Microsoft Support, adding stale label for automatic closure if not other reports are added.

fabiolune commented 4 years ago

Maybe I didn't properly understand how the cluster stop works, but when I stop a cluster (powerState correctly shows 'Stopped'), my workloads are still running (and so all the VMs of the scale set).

jornh commented 4 years ago

+1 on the experience @fabiolune reports (last week I think az aks show gave me "powerState": { "code": null } - but now it's reporting correctly "Stopped") - so progress there.

my workloads are still running (and so all the VMs of the scale set).

This triggered me to look a bit into the MC_<my_rg>_<my_cluster>_westeurope resource group in the Azure portal. I saw the VMSS was indeed still running. When trying to manually stop-start the VMSS in the portal, it came up again including pods being reported as running through kubectl get po.

k8s v1.18.8 (upgraded from v1.17.11). @palma21 I can provide additional cluster details if you don't already got enough from OP's case to repro and resolve.

palma21 commented 4 years ago

@fabiolune do you have support case open by any chance? We'd need to take a look a your specific case. (feel free to DM me your cluster FQDN on twitter as well @ jorgefpalma)

@jornh the OP error is expected as explained above since this feature does not support AVSets. We've since explicitly blocked the operation to run on those cases. In your case it seems you're running VM scale sets, please never use the VMSS API directly or AKS might loose the correct state and make things inconsistent.

I'm not sure I follow, you mention that your VMs were still running after the cluster stop operation succeeded and the cluster show as stopped? But you stopped and started the VMSS cluster manually on the VMSS portal? can you clarify further? (similarly feel free to send me the cluster details privately)

jornh commented 4 years ago

Got it regarding both OPs AVSet not being supported and going forward never using the VMSS API directly. Not toying around on services underneath the managed cluster makes sense (was only done here on my dev environment to educate myself and debug the case a bit further).

Yes correct, like fabiolune I could confirm:

  1. Workload on VMSS was still running (it responded to incoming requests)
  2. As well as the VMSS was reported as running in the portal after az aks stop and az aks show reported the cluster as stopped.
  3. My (non supported) VMSS stop experiment through the portal brought the workload to a state where it didn’t respond, as you’d expect.

I’ll Twitter DM you the cluster FQDN in a sec. I think the only config maybe worth mentioning is that this is a 1-3 VM autoscale cluster.

fabiolune commented 4 years ago

@palma21 sent you a twitter DM In my previous post I forgot to mention an additional detail: after stopping the cluster with az cli and getting a correct "Stopped" status, as I said the VMs and the workloads are still running, but I cannot reach the cluster with kubectl

palma21 commented 4 years ago

Hi @jornh / @fabiolune we checked both your cases and it's because cluster autoscaling (CA) was enabled on both cases. So CA prevented your nodes to ever be fully scaled down.

This is actually a scenario we're working on right now and will be fully supported at GA but for now please disable CA before stopping the cluster.

fabiolune commented 4 years ago

Thanks @palma21 for checking In any case something has changed because my daily job that tries to stop the cluster now fails with

{
  'code': 'DeleteVMSSAgentPoolFailed', 
  'message': 'We are unable to serve this request due to an internal error, Correlation ID: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx, Operation ID: xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx, Timestamp: 2020-10-12T21:00:17Z.'
}

but az aks show still returns powerState.code Stopped (as you sad presumably because CA is enabled)

palma21 commented 4 years ago

Yes, that's the same, essentially your control plane stops but CA is fighting back on removing the nodes :)

jornh commented 4 years ago

Thank you @palma21!

I see you even listed this under limitations during preview in the docs: https://github.com/MicrosoftDocs/azure-docs/commit/32b17629b2e707423b1a5698e82cd48ad9cb63c3#diff-d00d6ff5887a8cd73fd804c70802d50b7f327f3283ac0aae493b9e9de6f271b5

Happy to hear you’re working on resolving the limitation before GA.

edit and yes, can confirm that after I disabled CA, now stop turns off the VMSS as expected.

Sergei-Vorobyov commented 3 years ago

I tried repeatedly az aks stop/start clusters obeying all the conditions in https://docs.microsoft.com/en-us/azure/aks/start-stop-cluster, i.e., VMSS, disabling AS before stopping, etc. It fails pretty frequently, although the longest streak I got was 19 stop/start cycles (but usually 2-3). It seems that it always breaks on az aks start with:

ValidationError: Deployment failed. Correlation ID: ... AKS encountered an internal error while attempting the requested None operation. AKS will continuously retry the requested operation until successful or a retry timeout is hit

leaving a cluster in "provisioningState": "Failed"

Subsequent attempts to start result in:

ValidationError: Operation failed with status: 'Bad Request'. Details: managed cluster is in (Failed,Running) state, starting cannot be performed

The failures may happen allright, but most importantly, how can we return our clusters to a non-Failed state? For the moment, there seem to be no other options except deleting and recreating a cluster. Is there any better way? Thanks!

palma21 commented 3 years ago

Any cluster in this situation can always be recovered by us via support ticket. Feel free to drop your ticket number so we may follow up internally and look at recovering them as well understanding why you're seeing such a big number of failures for your cases.

Sergei-Vorobyov commented 3 years ago

Thanks! My Support request ID is 2010290040002472

Sergei-Vorobyov commented 3 years ago

Greetings! People from support suggested that a cluster in a failed state can be fixed by upgrading to the same k8s version, as in az aks upgrade --name aks-sv007 --resource-group rg-sv007 --subscription Hi3G-Infra-Dev --kubernetes-version 1.19.0 --yes Cluster currently in failed state. Proceeding with upgrade to existing version 1.19.0 to attempt resolution of failed cluster state and it succeeded, at least twice. Could you please advise whether the preview az aks stop / start feature is on the roadmap for the forseeable future? At present it breaks every few cycles (1-3), but we can temporarily and hopefully fix it by this upgrade trick every time it breaks. Thanks!

ghost commented 3 years ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

ghost commented 3 years ago

This issue will now be closed because it hasn't had any activity for 15 days after stale. vermegi feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.