Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.95k stars 305 forks source link

Unable to connect to the server: net/http: TLS handshake timeout #14

Closed davidjsanders closed 5 years ago

davidjsanders commented 6 years ago

Hi, when I create an AKS cluster, I'm receiving a timeout on the TLS handshake. The cluster creates okay with the following commands:

az group create --name dsK8S --location westus2

az aks create \
  --resource-group dsK8S \
  --name dsK8SCluster \
  --generate-ssh-keys \
  --dns-name-prefix dasanderk8 \
  --kubernetes-version 1.8.1 \
  --agent-count 2 \
  --agent-vm-size Standard_A2

az aks get-credentials --resource-group dsK8S --name dsK8SCluster

The response from the create command is a JSON object: { "id": "/subscriptions/OBFUSCATED/resourcegroups/dsK8S/providers/Microsoft.ContainerService/managedClusters/dsK8SCluster", "location": "westus2", "name": "dsK8SCluster", "properties": { "accessProfiles": { "clusterAdmin": { "kubeConfig": "OBFUSCATED" }, "clusterUser": { "kubeConfig": "OBFUSCATED" } }, "agentPoolProfiles": [ { "count": 2, "dnsPrefix": null, "fqdn": null, "name": "agentpool1", "osDiskSizeGb": null, "osType": "Linux", "ports": null, "storageProfile": "ManagedDisks", "vmSize": "Standard_A2", "vnetSubnetId": null } ], "dnsPrefix": "dasanderk8", "fqdn": "dasanderk8-d55f0987.hcp.westus2.azmk8s.io", "kubernetesVersion": "1.8.1", "linuxProfile": { "adminUsername": "azureuser", "ssh": { "publicKeys": [ { "keyData": "OBFUSCATED" } ] } }, "provisioningState": "Succeeded", "servicePrincipalProfile": { "clientId": "OBFUSCATED", "keyVaultSecretRef": null, "secret": null } }, "resourceGroup": "dsK8S", "tags": null, "type": "Microsoft.ContainerService/ManagedClusters" }

I've now torn down this cluster but this has happened three times today.

Any help?

David

alonisser commented 6 years ago

Also our US central cluster down with this today (since approx 19 hours ago) Strangely enough creating a new cluster in the same region does work

giorgited commented 6 years ago

One of our clusters also went down yesterday evening in West US. Recerating works but there must be a better workaround for this..

ericbarch commented 6 years ago

Just started seeing this issue in US East a few hours ago. Never experienced it before. Error occurs whether using Azure Cloud Shell or my local system.

traffk-viet commented 6 years ago

Hi. I'm also experiencing this issue from US East using my local system.

benclapp commented 6 years ago

I'm also experiencing the issue on two of our clusters in the us east region.

g-vista-group commented 6 years ago

We're also experiencing the issue on two of our clusters in the us east region.

CraigCarpenter commented 6 years ago

We've also been experiencing this issue for the last 7 hours. One cluster in US East.

rikkigouda commented 6 years ago

Same here... Will see if I can find what the issue is - it just started happening. Last time I was working with my AKS cluster was two/three days ago. Everything was working fine.

Martin-Aulich commented 6 years ago

me too in west-europe.

valdemarrolfsen commented 6 years ago

Same issue here, the whole cluster is down... This kind of unreliability should not be this frequent. Experiencing the issue in West Europe. This is why I am considering moving away from the whole Azure platform...

florinciubotariu commented 6 years ago

Same problem for West-Europe. Can we get some information about when is this going to be fixed?

rakeshv1 commented 6 years ago

Same issue here. Our cluster is deployed in east us

ziXet commented 6 years ago

Wo0w! these reports clearly show that AKS is not even ready for the Preview! Glad I have switched to ACS quickly!

florinciubotariu commented 6 years ago

Recreating the cluster in East US seems to work. The recreation of the cluster was made with: az aks create --resource-group myResourceGroup --name myAKSCluster --node-count 1 --generate-ssh-keys

my3sons commented 6 years ago

yes, the only solution for us as well was to recreate the cluster

gourlaa commented 6 years ago

I just received an email from Microsoft :

“After working with my backend engineer it appears that one of the nginx customer ingress controller was stlale. We restarted the nginx ingress controller pods on the cluster and now the APIServer is now responding to 100%.“

Do you still have the issue ? We also recreate the cluster so we can't check

CraigCarpenter commented 6 years ago

The issue was resolved for me.

gustavotroisgarcia commented 6 years ago

The issue was resolved for me too. I'm able to access cluster resources again.

ericbarch commented 6 years ago

Via Azure support, I was told it was a backplane issue that engineering just resolved (at least for our account). We're back up and running without having to recreate the cluster. :+1:

4c74356b41 commented 6 years ago

so this looks like a game, will the cluster be available. it was yesterday, but not today. I've cleared out .kube directory and regrabbed the credentials but... you get the idea. any real steps I can take?

mdavis-xyz commented 6 years ago

I just got the same error again today.

raycrawford commented 6 years ago

I am getting the TLS handshake error at 2:30 PM EST in East US:

kubectl get nodes
Unable to connect to the server: net/http: TLS handshake timeout
4c74356b41 commented 6 years ago

Also, for me kubectl\api calls from my laptop do not work, they work from Azure Cloud Shell only.

mdavis-xyz commented 6 years ago

I discovered the cause of my issue. In the portal my AKS cluster is still listed as "Creating...". It's been like that for several days now.

I tried a different region, with the default VM size, and that worked. It still took a long time to go from "Creating..." to normal, but it did get there eventually. Then all the subsequent commands worked.

necevil commented 6 years ago

Solution for me was to scale the Cluster nodes up by 1 (temporarily) and then once the new load launch connect. I was then successful and could scale the cluster down to the original size.

Full background can be found over here: https://stackoverflow.com/questions/50726534/unable-to-connect-net-http-tls-handshake-timeout-why-cant-kubectl-connect

emanuelecasadio commented 6 years ago

Same problem here. Sometimes I just cannot use kubectl.

novitoll commented 6 years ago

@emanuelecasadio AKS is now in GA. Make sure you either upgraded or have necessary patches installed.

SnehaJosephSTS commented 6 years ago

I am still facing this issue while running the "kubectl get nodes" command. I have tried the following but with no luck :(

  1. Upgrading Kubernetes Version
  2. Increasing node count via portal and then running "kubectl get nodes".
  3. Re-logging via "az login"
c-mccutcheon commented 6 years ago

@SnehaJosephSTS - we had to re-create our cluster after AKS went GA. Haven't had the issue since then. Upgrade for us did not work, nor did scaling.

danielrmartin commented 6 years ago

I am getting the error this morning. while trying to get nodes on a new cluster in eastus.

jawahar16 commented 6 years ago

I am getting the same issue in eastus. I enabled "rbac" with the AKS create command.,

az aks create --resource-group my-AKS-resource-group --name my-AKS-Cluster --node-count 3 --generate-ssh-keys --enable-rbac

kubectl get nodes Unable to connect to the server: net/http: TLS handshake timeout

qike-ms commented 6 years ago

There are many reasons behind TLS handshake timeout error. For clusters created before AKS GA, we highly recommend customers to create a new cluster and redeploy the system there.

We also recommend customer to upgrade clusters to stay to the latest or one version before latest supported K8S version.

Also make sure your cluster is not overloaded, meaning you didn't max out usable cpu and memory on the agent nodes. We've seen many times when someone scale cluster down from X nodes to 1, X being 5 or above, interruption to connection to control plane can happen as they might be running a lot of pods on the cluster and now all of them will be evicted and redeployed to the only node left. And if the node vm is very small, it can leave pods no place to schedule, including some mission critical pods (addons in kube-system)

If after all the diagnosis you still suffer from this issue, please don't hesitate to send email to aks-help@service.microsoft.com

mdavis-xyz commented 6 years ago

And if the node vm is very small, it can leave pods no place to schedule, including some mission critical pods

Isn't that a very big issue?

I've had many cluster break irreparably in this way. This bug doesn't just happen when scaling to 1. I've seen it happen when scaling nodes both up and down whilst there are too many pods. In my experience, AKS scaling when there are unsheduled pods tends to cause the cluster to break catastrophically, more often than not. The workaround is to delete the whole cluster and redeploy on a new one.

Thankfully I'm not dealing with a production workload, but imagine if I was. I'd be livid. I don't think I would ever choose to deploy a real production workload on AKS, because of this bug.

Is it possible to somehow get the scheduler to prioritise the system pods over the workload pods?

agolomoodysaada commented 5 years ago

After lots of back and forth with Azure support, we reached to this workaround. I have yet to try it as they fixed the issue on their end. However, it might help someone else facing this.

Anyway, here's their message:

This is usually means that tunnelfront cannot connect to tunnelend

1. ssh to the agent node which running the tunnelfront pod
2. get tunnelfront logs: "docker ps" -> "docker logs <tunnelfront_container_id>"
3. "nslookup <ssh-server_fqdn>" whose fqdn can be get from above command -> if it resolves ip, which means dns works, then go to the following step
4. "ssh -vv azureuser@<ssh-server_fqdn> -p 9000" ->if port is working, go to the next step
5. "docker exec -it <tunnelfront_container_id> /bin/bash", type "ping [google.com](http://google.com/)", if it is no response, which means tunnel front pod doesn't have external network, then do following step
6. restart kube-proxy, using "kubectl delete po <kube-proxy_pod> -n kube-system", choose the kube-proxy which is runing on the same node with tunnelfront. customer can use "kubectl get po -n kube-system -o wide"

P.S. Dear Azure team,

We should NOT close this issue as the bug still occurs from time to time. This is not considered an acceptable workaround. It's a mitigation for those whose clusters are stuck and cannot access logs, exec, or helm deployments. We still need a permanent fix designed for failure of either tunnelfront or tunnelend.

Would be nice if you could also explain what tunnelfront and tunnelend are and how they work. Why are we, consumers of AKS, responsible for maintaining Azure's buggy workloads?

Starefossen commented 5 years ago

Created a new cluster after GA and now out of a sudden getting a bunch of TLS handshake timeout from AKS. This does not give the feeling that AKS is anything near GA.

jaredallard commented 5 years ago

Yeah we run into this frequently, AKS master node availability is terrible. Constantly going down, timing out requests (nginx-ingress, even some of our applications that talk to k8s)... We don't run into any of these issues with GKE or kops environments. Not sure if this is anywhere near GA.

EDIT As I wrote this, our cluster has been unavailable for the last 20+ minutes saying "TLS handshake timeout". :unamused:

mcobzarenco commented 5 years ago

I set up a cluster with one node and I wanted to investigate differences to the GCP deployement, essentially do a dry run (our production deployment is on Google Cloud's Kubernetes but we're doing an Azure deployment for a client).

However, it seems like all kubectl operations as well as az browse fail with:

net/http: TLS handshake timeout

az version 2.0.46 kubectl 1.9.7

klarose commented 5 years ago

I just had this happen to myself. It appeared out of the blue, then went away a few hours later, after I restarted my nodes a few times, as well as killed most of my deployments. I'm not sure if that's what fixed it, or if whatever was truly causing the issue just went away.

Some notes from my investigation:

blackbaud-brandonstirnaman commented 5 years ago

Brand new cluster today... been online for just a few hours and TLS handshake timeouts. 👎

emirhosseini commented 5 years ago

Any update on this issue? We're still experiencing it

siyangy commented 5 years ago

We've been hitting this for a year, and the explanation earlier was that we were using a preview version of AKS cluster. Now we've moved to a new cluster (supposed after GA) and are still seeing it. I think it's worth bumping the priority as the issue has been around for a long while and is affecting a lot of folks.

ghost commented 5 years ago

I've found the solution!!!

4c74356b41 commented 5 years ago

and that is? @adamsem

ghost commented 5 years ago

Migrate to AWS :)

jnoller commented 5 years ago

Hi Everyone;

AKS has rolled out a lot of enhancements and improvements to mitigate this including auto-detection of hung/blocked API servers, kubelets and proxies. One of the final components is to scale up the Master components to meet the overall workload load against the master APIs.

This issue (this github issue) contains a lot of cluster-specific reports - as we can not safely request the data for your accounts to do deeper introspection here on github, I'd ask if you could please file Azure technical support issues for diagnosis (these support issues get routed to our back end on-call team as needed for resolution).

Additionally, the errors displayed can also correlate to underlying service updates in some cases (especially if you are seeing it randomly, for a limited amount of time). This will be helped with the auto scaling (increased master count) being worked on.

For issues that come up after I close this, please file new github issues that include instructions for re-creation on any AKS cluster (e.g. general not-tied-to-your-app-or-cluster). This will help support and engineering debug.

sanojdev89 commented 5 years ago

kubectl get pods --insecure-skip-tls-verify=true gives below error Unable to connect to the server: net/http: TLS handshake timeout Build step 'Execute shell' marked build as failure this command works on jenkins server but fails while running via a jenkins job