Closed vermegi closed 3 years ago
Hi vermegi, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.
I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!
Hi there! Looking at the error message when you try to start the cluster, it looks like your cluster is using Availability Set nodepools. Would you be able to confirm that for me?
I'm experiencing the same problem when I try to start the cluster and I can confirm that in my case I am using availability sets.
I'm using VirtualMachineScaleSet and running into the issue:
managed cluster is in (Succeeded,Stopped) state, starting cannot be performed
when trying to start the cluster.
k8s v. 1.19.0
This feature is only supported for VMSS-based clusters. This should be the case for all new features in AKS.
@jeanfrancoislarente 1.19 just launched in preview so there might be something there, are you able to open a support ticket and share the ticket number here? Otherwise send me a DM with your cluster details.
Hi there :wave: AKS bot here. This issue has been tagged as needing a support request so that the AKS support and engineering teams have a look into this particular cluster/issue.
Follow the steps here to create a support ticket for Azure Kubernetes Service and the cluster discussed in this issue.
Please do mention this issue in the case description so our teams can coordinate to help you.
Thank you!
I'm running a private VMSS-based cluster with K8s 1.16, same issue: aks stop results in a "null" status and then it cannot be restarted. This is the command I use to create the cluster:
az aks create --resource-group $RG_NAME --name $CLUSTER_NAME \
--kubernetes-version $VERSION \
--location $LOCATION \
--subscription $SUBSCRIPTION \
--enable-private-cluster \
--generate-ssh-keys \
--node-vm-size $NODE_SIZE \
--load-balancer-sku standard \
--node-count $NODE_COUNT --node-osdisk-size $NODE_DISK_SIZE \
--network-plugin $CNI_PLUGIN \
--vnet-subnet-id $AKS_SNET_ID \
--docker-bridge-address 172.17.0.1/16 \
--dns-service-ip 10.2.0.10 \
--service-cidr 10.2.0.0/24
Kubernetes 1.17.11 with VMSS - same problem!
@mfabiani-av private clusters are not supported during public preview. We working to have it even before GA, it's next on our plate. https://docs.microsoft.com/en-us/azure/aks/start-stop-cluster#limitations
@mcurko is your case a private cluster as well by any chance?
What qualifies a cluster to be private? I didn't know there was a distinction between "private" and (I guess) "public" clusters.
A private cluster is one that has it's API plane not publicly available, but basically uses private link to make it available in your AKS subnet. Described at https://docs.microsoft.com/en-us/azure/aks/private-clusters
Case being worked with Microsoft Support, adding stale label for automatic closure if not other reports are added.
Maybe I didn't properly understand how the cluster stop works, but when I stop a cluster (powerState correctly shows 'Stopped'), my workloads are still running (and so all the VMs of the scale set).
+1 on the experience @fabiolune reports (last week I think az aks show
gave me "powerState": { "code": null }
- but now it's reporting correctly "Stopped"
) - so progress there.
my workloads are still running (and so all the VMs of the scale set).
This triggered me to look a bit into the MC_<my_rg>_<my_cluster>_westeurope
resource group in the Azure portal. I saw the VMSS was indeed still running. When trying to manually stop-start the VMSS in the portal, it came up again including pods being reported as running through kubectl get po
.
k8s v1.18.8 (upgraded from v1.17.11). @palma21 I can provide additional cluster details if you don't already got enough from OP's case to repro and resolve.
@fabiolune do you have support case open by any chance? We'd need to take a look a your specific case. (feel free to DM me your cluster FQDN on twitter as well @ jorgefpalma)
@jornh the OP error is expected as explained above since this feature does not support AVSets. We've since explicitly blocked the operation to run on those cases. In your case it seems you're running VM scale sets, please never use the VMSS API directly or AKS might loose the correct state and make things inconsistent.
I'm not sure I follow, you mention that your VMs were still running after the cluster stop operation succeeded and the cluster show as stopped? But you stopped and started the VMSS cluster manually on the VMSS portal? can you clarify further? (similarly feel free to send me the cluster details privately)
Got it regarding both OPs AVSet not being supported and going forward never using the VMSS API directly. Not toying around on services underneath the managed cluster makes sense (was only done here on my dev environment to educate myself and debug the case a bit further).
Yes correct, like fabiolune I could confirm:
az aks stop
and az aks show
reported the cluster as stopped.I’ll Twitter DM you the cluster FQDN in a sec. I think the only config maybe worth mentioning is that this is a 1-3 VM autoscale cluster.
@palma21 sent you a twitter DM
In my previous post I forgot to mention an additional detail: after stopping the cluster with az
cli and getting a correct "Stopped" status, as I said the VMs and the workloads are still running, but I cannot reach the cluster with kubectl
Hi @jornh / @fabiolune we checked both your cases and it's because cluster autoscaling (CA) was enabled on both cases. So CA prevented your nodes to ever be fully scaled down.
This is actually a scenario we're working on right now and will be fully supported at GA but for now please disable CA before stopping the cluster.
Thanks @palma21 for checking In any case something has changed because my daily job that tries to stop the cluster now fails with
{
'code': 'DeleteVMSSAgentPoolFailed',
'message': 'We are unable to serve this request due to an internal error, Correlation ID: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx, Operation ID: xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx, Timestamp: 2020-10-12T21:00:17Z.'
}
but az aks show
still returns powerState.code Stopped
(as you sad presumably because CA is enabled)
Yes, that's the same, essentially your control plane stops but CA is fighting back on removing the nodes :)
Thank you @palma21!
I see you even listed this under limitations during preview in the docs: https://github.com/MicrosoftDocs/azure-docs/commit/32b17629b2e707423b1a5698e82cd48ad9cb63c3#diff-d00d6ff5887a8cd73fd804c70802d50b7f327f3283ac0aae493b9e9de6f271b5
Happy to hear you’re working on resolving the limitation before GA.
edit and yes, can confirm that after I disabled CA, now stop turns off the VMSS as expected.
I tried repeatedly az aks stop/start clusters obeying all the conditions in https://docs.microsoft.com/en-us/azure/aks/start-stop-cluster, i.e., VMSS, disabling AS before stopping, etc. It fails pretty frequently, although the longest streak I got was 19 stop/start cycles (but usually 2-3). It seems that it always breaks on az aks start with:
ValidationError: Deployment failed. Correlation ID: ... AKS encountered an internal error while attempting the requested None operation. AKS will continuously retry the requested operation until successful or a retry timeout is hit
leaving a cluster in "provisioningState": "Failed"
Subsequent attempts to start result in:
ValidationError: Operation failed with status: 'Bad Request'. Details: managed cluster is in (Failed,Running) state, starting cannot be performed
The failures may happen allright, but most importantly, how can we return our clusters to a non-Failed state? For the moment, there seem to be no other options except deleting and recreating a cluster. Is there any better way? Thanks!
Any cluster in this situation can always be recovered by us via support ticket. Feel free to drop your ticket number so we may follow up internally and look at recovering them as well understanding why you're seeing such a big number of failures for your cases.
Thanks! My Support request ID is 2010290040002472
Greetings!
People from support suggested that a cluster in a failed state can be fixed by upgrading to the same k8s version, as in
az aks upgrade --name aks-sv007 --resource-group rg-sv007 --subscription Hi3G-Infra-Dev --kubernetes-version 1.19.0 --yes
Cluster currently in failed state. Proceeding with upgrade to existing version 1.19.0 to attempt resolution of failed cluster state
and it succeeded, at least twice.
Could you please advise whether the preview az aks stop / start feature is on the roadmap for the forseeable future? At present it breaks every few cycles (1-3), but we can temporarily and hopefully fix it by this upgrade trick every time it breaks.
Thanks!
This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.
This issue will now be closed because it hasn't had any activity for 15 days after stale. vermegi feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.
I used the AKS start/stop preview feature using guidance at https://docs.microsoft.com/en-us/azure/aks/start-stop-cluster, however my cluster was not stopped correctly.
When I query the state of the cluster through az aks show the powerstate property reports a value of 'null'.
I cannot stop the cluster a second time. This gives a BadRequest error:
ValidationError: Operation failed with status: 'Bad Request'. Details: managed cluster is in (Succeeded,Stopped) state, stopping cannot be performed
I cannot start the cluster again:
ValidationError: Operation failed with status: 'Bad Request'. Details: Client Error: Availability Sets Not Supported
The worker node virtual machine associated with my cluster is also still in a running state.
What you expected to happen:
Cluster to go in a proper stopped state and my worker node(s) to be in a stopped state. If it turns out the version I am running is too old for stopping the cluster I would expect a proper error message to say first upgrade your cluster to the latest version before you issue a stop command.
How to reproduce it (as minimally and precisely as possible):
Create a 1 node cluster with version 1.16.9. Run through https://docs.microsoft.com/en-us/azure/aks/start-stop-cluster steps.
Environment:
kubectl version
): 1.16.9 (value in the portal)