Closed davidjsanders closed 5 years ago
Also our US central cluster down with this today (since approx 19 hours ago) Strangely enough creating a new cluster in the same region does work
One of our clusters also went down yesterday evening in West US. Recerating works but there must be a better workaround for this..
Just started seeing this issue in US East a few hours ago. Never experienced it before. Error occurs whether using Azure Cloud Shell or my local system.
Hi. I'm also experiencing this issue from US East using my local system.
I'm also experiencing the issue on two of our clusters in the us east region.
We're also experiencing the issue on two of our clusters in the us east region.
We've also been experiencing this issue for the last 7 hours. One cluster in US East.
Same here... Will see if I can find what the issue is - it just started happening. Last time I was working with my AKS cluster was two/three days ago. Everything was working fine.
me too in west-europe.
Same issue here, the whole cluster is down... This kind of unreliability should not be this frequent. Experiencing the issue in West Europe. This is why I am considering moving away from the whole Azure platform...
Same problem for West-Europe. Can we get some information about when is this going to be fixed?
Same issue here. Our cluster is deployed in east us
Wo0w! these reports clearly show that AKS is not even ready for the Preview! Glad I have switched to ACS quickly!
Recreating the cluster in East US seems to work.
The recreation of the cluster was made with:
az aks create --resource-group myResourceGroup --name myAKSCluster --node-count 1 --generate-ssh-keys
yes, the only solution for us as well was to recreate the cluster
I just received an email from Microsoft :
“After working with my backend engineer it appears that one of the nginx customer ingress controller was stlale. We restarted the nginx ingress controller pods on the cluster and now the APIServer is now responding to 100%.“
Do you still have the issue ? We also recreate the cluster so we can't check
The issue was resolved for me.
The issue was resolved for me too. I'm able to access cluster resources again.
Via Azure support, I was told it was a backplane issue that engineering just resolved (at least for our account). We're back up and running without having to recreate the cluster. :+1:
so this looks like a game, will the cluster be available. it was yesterday, but not today. I've cleared out .kube directory and regrabbed the credentials but... you get the idea. any real steps I can take?
I just got the same error again today.
I am getting the TLS handshake error at 2:30 PM EST in East US:
kubectl get nodes
Unable to connect to the server: net/http: TLS handshake timeout
Also, for me kubectl\api calls from my laptop do not work, they work from Azure Cloud Shell only.
I discovered the cause of my issue. In the portal my AKS cluster is still listed as "Creating...". It's been like that for several days now.
I tried a different region, with the default VM size, and that worked. It still took a long time to go from "Creating..." to normal, but it did get there eventually. Then all the subsequent commands worked.
Solution for me was to scale the Cluster nodes up by 1 (temporarily) and then once the new load launch connect. I was then successful and could scale the cluster down to the original size.
Full background can be found over here: https://stackoverflow.com/questions/50726534/unable-to-connect-net-http-tls-handshake-timeout-why-cant-kubectl-connect
Same problem here. Sometimes I just cannot use kubectl.
@emanuelecasadio AKS is now in GA. Make sure you either upgraded or have necessary patches installed.
I am still facing this issue while running the "kubectl get nodes" command. I have tried the following but with no luck :(
@SnehaJosephSTS - we had to re-create our cluster after AKS went GA. Haven't had the issue since then. Upgrade for us did not work, nor did scaling.
I am getting the error this morning. while trying to get nodes on a new cluster in eastus.
I am getting the same issue in eastus. I enabled "rbac" with the AKS create command.,
az aks create --resource-group my-AKS-resource-group --name my-AKS-Cluster --node-count 3 --generate-ssh-keys --enable-rbac
kubectl get nodes Unable to connect to the server: net/http: TLS handshake timeout
There are many reasons behind TLS handshake timeout error. For clusters created before AKS GA, we highly recommend customers to create a new cluster and redeploy the system there.
We also recommend customer to upgrade clusters to stay to the latest or one version before latest supported K8S version.
Also make sure your cluster is not overloaded, meaning you didn't max out usable cpu and memory on the agent nodes. We've seen many times when someone scale cluster down from X nodes to 1, X being 5 or above, interruption to connection to control plane can happen as they might be running a lot of pods on the cluster and now all of them will be evicted and redeployed to the only node left. And if the node vm is very small, it can leave pods no place to schedule, including some mission critical pods (addons in kube-system)
If after all the diagnosis you still suffer from this issue, please don't hesitate to send email to aks-help@service.microsoft.com
And if the node vm is very small, it can leave pods no place to schedule, including some mission critical pods
Isn't that a very big issue?
I've had many cluster break irreparably in this way. This bug doesn't just happen when scaling to 1. I've seen it happen when scaling nodes both up and down whilst there are too many pods. In my experience, AKS scaling when there are unsheduled pods tends to cause the cluster to break catastrophically, more often than not. The workaround is to delete the whole cluster and redeploy on a new one.
Thankfully I'm not dealing with a production workload, but imagine if I was. I'd be livid. I don't think I would ever choose to deploy a real production workload on AKS, because of this bug.
Is it possible to somehow get the scheduler to prioritise the system pods over the workload pods?
After lots of back and forth with Azure support, we reached to this workaround. I have yet to try it as they fixed the issue on their end. However, it might help someone else facing this.
Anyway, here's their message:
This is usually means that tunnelfront cannot connect to tunnelend
1. ssh to the agent node which running the tunnelfront pod
2. get tunnelfront logs: "docker ps" -> "docker logs <tunnelfront_container_id>"
3. "nslookup <ssh-server_fqdn>" whose fqdn can be get from above command -> if it resolves ip, which means dns works, then go to the following step
4. "ssh -vv azureuser@<ssh-server_fqdn> -p 9000" ->if port is working, go to the next step
5. "docker exec -it <tunnelfront_container_id> /bin/bash", type "ping [google.com](http://google.com/)", if it is no response, which means tunnel front pod doesn't have external network, then do following step
6. restart kube-proxy, using "kubectl delete po <kube-proxy_pod> -n kube-system", choose the kube-proxy which is runing on the same node with tunnelfront. customer can use "kubectl get po -n kube-system -o wide"
P.S. Dear Azure team,
We should NOT close this issue as the bug still occurs from time to time. This is not considered an acceptable workaround. It's a mitigation for those whose clusters are stuck and cannot access logs, exec, or helm deployments. We still need a permanent fix designed for failure of either tunnelfront or tunnelend.
Would be nice if you could also explain what tunnelfront and tunnelend are and how they work. Why are we, consumers of AKS, responsible for maintaining Azure's buggy workloads?
Created a new cluster after GA and now out of a sudden getting a bunch of TLS handshake timeout from AKS. This does not give the feeling that AKS is anything near GA.
Yeah we run into this frequently, AKS master node availability is terrible. Constantly going down, timing out requests (nginx-ingress, even some of our applications that talk to k8s)... We don't run into any of these issues with GKE or kops environments. Not sure if this is anywhere near GA.
EDIT As I wrote this, our cluster has been unavailable for the last 20+ minutes saying "TLS handshake timeout". :unamused:
I set up a cluster with one node and I wanted to investigate differences to the GCP deployement, essentially do a dry run (our production deployment is on Google Cloud's Kubernetes but we're doing an Azure deployment for a client).
However, it seems like all kubectl
operations as well as az browse
fail with:
net/http: TLS handshake timeout
az version 2.0.46 kubectl 1.9.7
I just had this happen to myself. It appeared out of the blue, then went away a few hours later, after I restarted my nodes a few times, as well as killed most of my deployments. I'm not sure if that's what fixed it, or if whatever was truly causing the issue just went away.
Some notes from my investigation:
Curl to the host @ port 443 used by my kube-config would get through until it failed the TLS handshake. However, the instant I put a proper API call in there, it would usually either fail to connect at all, or timeout mid-handshake. Every once in a while it would get through and fail the handshake. This mirrored the behaviour I saw with kubectl. This makes me suspect that there is an issue with whatever backend service the api calls are routed to.
I'm using istio and helm/tiller
2-node cluster.
kubectl version 1.12.0
kubernetes version 1.10.7
canadacentral
Brand new cluster today... been online for just a few hours and TLS handshake timeouts. 👎
Any update on this issue? We're still experiencing it
We've been hitting this for a year, and the explanation earlier was that we were using a preview version of AKS cluster. Now we've moved to a new cluster (supposed after GA) and are still seeing it. I think it's worth bumping the priority as the issue has been around for a long while and is affecting a lot of folks.
I've found the solution!!!
and that is? @adamsem
Migrate to AWS :)
Hi Everyone;
AKS has rolled out a lot of enhancements and improvements to mitigate this including auto-detection of hung/blocked API servers, kubelets and proxies. One of the final components is to scale up the Master components to meet the overall workload load against the master APIs.
This issue (this github issue) contains a lot of cluster-specific reports - as we can not safely request the data for your accounts to do deeper introspection here on github, I'd ask if you could please file Azure technical support issues for diagnosis (these support issues get routed to our back end on-call team as needed for resolution).
Additionally, the errors displayed can also correlate to underlying service updates in some cases (especially if you are seeing it randomly, for a limited amount of time). This will be helped with the auto scaling (increased master count) being worked on.
For issues that come up after I close this, please file new github issues that include instructions for re-creation on any AKS cluster (e.g. general not-tied-to-your-app-or-cluster). This will help support and engineering debug.
kubectl get pods --insecure-skip-tls-verify=true gives below error Unable to connect to the server: net/http: TLS handshake timeout Build step 'Execute shell' marked build as failure this command works on jenkins server but fails while running via a jenkins job
Hi, when I create an AKS cluster, I'm receiving a timeout on the TLS handshake. The cluster creates okay with the following commands:
The response from the create command is a JSON object: { "id": "/subscriptions/OBFUSCATED/resourcegroups/dsK8S/providers/Microsoft.ContainerService/managedClusters/dsK8SCluster", "location": "westus2", "name": "dsK8SCluster", "properties": { "accessProfiles": { "clusterAdmin": { "kubeConfig": "OBFUSCATED" }, "clusterUser": { "kubeConfig": "OBFUSCATED" } }, "agentPoolProfiles": [ { "count": 2, "dnsPrefix": null, "fqdn": null, "name": "agentpool1", "osDiskSizeGb": null, "osType": "Linux", "ports": null, "storageProfile": "ManagedDisks", "vmSize": "Standard_A2", "vnetSubnetId": null } ], "dnsPrefix": "dasanderk8", "fqdn": "dasanderk8-d55f0987.hcp.westus2.azmk8s.io", "kubernetesVersion": "1.8.1", "linuxProfile": { "adminUsername": "azureuser", "ssh": { "publicKeys": [ { "keyData": "OBFUSCATED" } ] } }, "provisioningState": "Succeeded", "servicePrincipalProfile": { "clientId": "OBFUSCATED", "keyVaultSecretRef": null, "secret": null } }, "resourceGroup": "dsK8S", "tags": null, "type": "Microsoft.ContainerService/ManagedClusters" }
I've now torn down this cluster but this has happened three times today.
Any help?
David