Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.93k stars 295 forks source link

Unable to connect to the server: net/http: TLS handshake timeout #14

Closed davidjsanders closed 5 years ago

davidjsanders commented 6 years ago

Hi, when I create an AKS cluster, I'm receiving a timeout on the TLS handshake. The cluster creates okay with the following commands:

az group create --name dsK8S --location westus2

az aks create \
  --resource-group dsK8S \
  --name dsK8SCluster \
  --generate-ssh-keys \
  --dns-name-prefix dasanderk8 \
  --kubernetes-version 1.8.1 \
  --agent-count 2 \
  --agent-vm-size Standard_A2

az aks get-credentials --resource-group dsK8S --name dsK8SCluster

The response from the create command is a JSON object: { "id": "/subscriptions/OBFUSCATED/resourcegroups/dsK8S/providers/Microsoft.ContainerService/managedClusters/dsK8SCluster", "location": "westus2", "name": "dsK8SCluster", "properties": { "accessProfiles": { "clusterAdmin": { "kubeConfig": "OBFUSCATED" }, "clusterUser": { "kubeConfig": "OBFUSCATED" } }, "agentPoolProfiles": [ { "count": 2, "dnsPrefix": null, "fqdn": null, "name": "agentpool1", "osDiskSizeGb": null, "osType": "Linux", "ports": null, "storageProfile": "ManagedDisks", "vmSize": "Standard_A2", "vnetSubnetId": null } ], "dnsPrefix": "dasanderk8", "fqdn": "dasanderk8-d55f0987.hcp.westus2.azmk8s.io", "kubernetesVersion": "1.8.1", "linuxProfile": { "adminUsername": "azureuser", "ssh": { "publicKeys": [ { "keyData": "OBFUSCATED" } ] } }, "provisioningState": "Succeeded", "servicePrincipalProfile": { "clientId": "OBFUSCATED", "keyVaultSecretRef": null, "secret": null } }, "resourceGroup": "dsK8S", "tags": null, "type": "Microsoft.ContainerService/ManagedClusters" }

I've now torn down this cluster but this has happened three times today.

Any help?

David

davidjsanders commented 6 years ago

Update 11/03: I'm now able to create clusters successfully in uswest2; however, I'm still getting TLS handshake errors:

az aks browse --resource-group *OBFUSCATED* --name *OBFUSCATED*
Merged "*OBFUSCATED*" as current context in /tmp/tmpB988cA
Proxy running on http://127.0.0.1:8001/
Press CTRL+C to close the tunnel...
error: error upgrading connection: error dialing backend: dial tcp 10.240.0.4:10250: getsockopt: connection refused

Are we still in the realm of capacity issues or is there another underlying issue here? This should work, right?

David

davidjsanders commented 6 years ago

Sometime I should look before I write :)

I see the problem. The proxy is trying to connect to 10.240.0.4 which is the private IP of one of the agents and won't (and shouldn't) be reachable from the Internet. I'm guessing this is the underlying issue here.

amazaheri commented 6 years ago

+1 originally this worked fine, I noticed the isse today when I deleted the cluster and tried to recreate it.

amazaheri commented 6 years ago

I get this regardless of using West US 2 or UK West: ~ amazaheri$ az aks browse -n mtcirvk8s -g mtcirvacs-rg Merged "mtcirvk8s" as current context in /var/folders/sf/p87ql6z9271_1l7cp6hgt2d40000gp/T/tmpHZ_Er0 Proxy running on http://127.0.0.1:8001/ Press CTRL+C to close the tunnel... error: error upgrading connection: error dialing backend: dial tcp 10.240.0.4:10250: getsockopt: connection refused

amazaheri commented 6 years ago

Looks like we are good now, thanks for all the work! QQ: I cannot connect with Cabin app to my cluster using token. The app shows cluster as running but I can see any of the nodes, namespaces, etc. looks like the auth fails at some point. Thoughts? https://github.com/bitnami/cabin/issues/75

eirikm commented 6 years ago

I'm having the same problem in West US 2 at the moment:

$ kubectl get pods --all-namespaces Unable to connect to the server: net/http: TLS handshake timeout

nyuen commented 6 years ago

same issue here on West US 2

krol3 commented 6 years ago

The cluster aks is in West US 2. I have the same issue.

kubectl get nodes Unable to connect to the server: net/http: TLS handshake timeout

az aks browse --resource-group xxxx-rg --name xxxx Merged "XXXX" as current context in /tmp/tmpx6o89zj7 Unable to connect to the server: net/http: TLS handshake timeout Command '['kubectl', 'get', 'pods', '--namespace', 'kube-system', '--output', 'name', '--selector', 'k8s-app=kubernetes-dashboard']' returned non-zero exit status 1.

davidjsanders commented 6 years ago

11/9: I'm still getting issues and have reverted back to unmanaged cluster using ACS and Kubernetes as the controller. Look forward to when AKS becomes a little more stable,.

twitchax commented 6 years ago

I am having these same issues!

krol3 commented 6 years ago

@dsandersAzure I did the same, I created using ACS !!

yejason commented 6 years ago

AKS still in preview, for now, it seems west us 2 is not available, but ukwest is ok. We can create aks in ukwest now.

C:\Users\jason>az group create --name akss --location ukwest
{
  "id": "/subscriptions/xxxxxxx-222b-49c3-xxxx-xxxxx1e29a7b15/resourceGroups/akss",
  "location": "ukwest",
  "managedBy": null,
  "name": "akss",
  "properties": {
    "provisioningState": "Succeeded"
  },
  "tags": null
}

C:\Users\jason>az aks create --resource-group akss --name myK8sCluster --agent-count 1 --generate-ssh-keys
{| Finished ..
  "id": "/subscriptions/xxxxxxxx-222b-49c3-xxxx-0361e29axxxx/resourcegroups/akss/providers/Microsoft.ContainerService/managedClusters/myK8sCluster",
  "location": "ukwest",
  "name": "myK8sCluster",
  "properties": {
    "accessProfiles": {
      "clusterAdmin": {
        "kubeConfig": "YXBpVmVyc2lvbjogdjEKY2x1c3RlcnM6Ci0gY2x1c3RlcjoKICAgIGNlcnRpZmljYXRlLWF1dGhvcml0eS1kYXRhOiBMUzB0TFMxQ1JVZEpUaUJEUlZKVVNVWkpRMEZVUlMwdExTMHRDazFKU1VWNGVrTkRRWEVyWjBGM1NVSkJaMGxSWlhVMGVXRnBOekp3TlhadmNsUjRha2hMTldReGVrRk9RbWRyY1docmFVYzVkekJDUVZGelJrRkVRVTRLVFZGemQwTlJXVVJXVVZGRVJYZEthbGxVUVdWR2R6QjRUbnBGZUUxVVFYZE5WRlV4VFdwS1lVWjNNSGhQVkVWNFRWUkJkMDFVVlRGTmFrcGhUVUV3ZUFwRGVrRktRbWRPVmtKQlRWUkJiVTVvVFVsSlEwbHFRVTVDWjJ0eGFHdHBSemwzTUVKQlVVVkdRVUZQUTBGbk9FRk5TVWxEUTJkTFEwRm5SVUZ6TlRCRENsaGFNSEJCZWtJdlYxWnRjR1ZZTkhwaFRtZzVXRFJIVjIxWWFHTnpaelIyZVRWVGQxaDNVVTB2U1dkMWRGbGFVRzFUTjFCelVUUXJZazluWkZCWGVXSUtaREp6YWxSclJsVXZPRzVMYzJzM0sxcHhPRmxWTURFMFpVWkJXamx2UlRWNUsyRmhLMlZ
eivim commented 6 years ago

I believe capacity issues in ukwest is ongoing, hoping AKS will expand to other locations in Europe soon. Had a 1.7.7 cluster in ukwest that broke a couple of days ago. Attempted to recreate today, but it is still in a bad state.

$ kubectl get pods -n kube-system
NAME                                    READY     STATUS             RESTARTS   AGE
heapster-b5ff6c4dd-dkkll                2/2       Running            0          46m
kube-dns-v20-6c8f7f988b-cb4cg           3/3       Running            0          46m
kube-dns-v20-6c8f7f988b-ztn5r           3/3       Running            0          46m
kube-proxy-thz9p                        1/1       Running            0          46m
kube-svc-redirect-qhwz6                 0/1       CrashLoopBackOff   13         46m
kubernetes-dashboard-7f7d9489fc-d9x7d   0/1       CrashLoopBackOff   12         46m
tunnelfront-xzjq8                       0/1       CrashLoopBackOff   13         46m

$ kubectl logs kube-svc-redirect-qhwz6 -n kube-system
Error from server: Get https://aks-agentpool1-28161470-0:10250/containerLogs/kube-system/kube-svc-redirect-qhwz6/redirector: dial tcp 10.240.0.4:10250: getsockopt: connection refused
qmfrederik commented 6 years ago

So, provisioning in westuk gives me a cluster with crashing pods; provisioning in westus2 doesn't work at all:

Azure Container Service is unable to provision an AKS cluster in westus2, due to an operational threshold. Please try again later or use an alternate location. For more details please refer to: https://github.com/Azure/AKS/blob/master/preview_regions.md.

acesyde commented 6 years ago

Hi,

Same here today, I created an aks 1.8.1 on westeurope and it's ok, but one hour later I upgraded to 1.8.2 and since

Unable to connect to the server: net/http: TLS handshake timeout

kubectl 1.8.0 and 1.8.4 same error.

After that I cant create new aks on westeurope location cli return this

cmd : az aks create -n saceaks -g saceaks --location westeurope --kubernetes-version 1.8.1 --node-vm-size=Standard_DS1_V2 --node-count=2

cli error

Exception in thread AzureOperationPoller(b39cfa6a-a15e-49e4-9684-9cff4a0b579b):
Traceback (most recent call last):
  File "/opt/az/lib/python3.6/site-packages/msrestazure/azure_operation.py", line 377, in _start
    self._poll(update_cmd)
  File "/opt/az/lib/python3.6/site-packages/msrestazure/azure_operation.py", line 464, in _poll
    raise OperationFailed("Operation failed or cancelled")
msrestazure.azure_operation.OperationFailed: Operation failed or cancelled

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/az/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/opt/az/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/az/lib/python3.6/site-packages/msrestazure/azure_operation.py", line 388, in _start
    self._exception = CloudError(self._response)
  File "/opt/az/lib/python3.6/site-packages/msrestazure/azure_exceptions.py", line 148, in __init__
    self._build_error_data(response)
  File "/opt/az/lib/python3.6/site-packages/msrestazure/azure_exceptions.py", line 164, in _build_error_data
    self.error = self.deserializer('CloudErrorRoot', response).error
  File "/opt/az/lib/python3.6/site-packages/msrest/serialization.py", line 992, in __call__
    value = self.deserialize_data(raw_value, attr_desc['type'])
  File "/opt/az/lib/python3.6/site-packages/msrest/serialization.py", line 1143, in deserialize_data
    return self(obj_type, data)
  File "/opt/az/lib/python3.6/site-packages/msrest/serialization.py", line 998, in __call__
    return self._instantiate_model(response, d_attrs)
  File "/opt/az/lib/python3.6/site-packages/msrest/serialization.py", line 1090, in _instantiate_model
    response_obj = response(**kwargs)
  File "/opt/az/lib/python3.6/site-packages/msrestazure/azure_exceptions.py", line 59, in __init__
    self.message = kwargs.get('message')
  File "/opt/az/lib/python3.6/site-packages/msrestazure/azure_exceptions.py", line 105, in message
    value = eval(value)
  File "<string>", line 1, in <module>
NameError: name 'resources' is not defined

{
  "id": null,
  "location": null,
  "name": "e0ecdbcf-dffd-6b43-81fa-85f6517448a6",
  "properties": null,
  "tags": null,
  "type": null
}
kahootali commented 6 years ago

Having the same issue. I have two clusters, One East US and other Central US, The central US works fine but when I switch context to East US, it gives the error Unable to connect to the server: net/http: TLS handshake timeout

relferreira commented 6 years ago

I'm having the same issue after downscaling my cluster in East US!

hanzenok commented 6 years ago

Hi everyone,

Having same issue today on westeurope. And when I try to create a new cluster in this location, it gives an error: Deployment failed. Correlation ID: <id>. Azure Container Service is unable to provision an AKS cluster in westeurope, due to an operational threshold. Please try again later or use an alternate location. For more details please refer to: https://github.com/Azure/AKS/blob/master/preview_regions.md.

garystafford commented 6 years ago

Still an issue. Any resolution? This is my third running cluster I have lost the ability to communicate with, in East US. Doing an upgrade or scaling up the nodes does not work properly - a complete deal breaker when considering AKS. Either of these commands results in Unable to connect to the server: net/http: TLS handshake timeout. I've tried numerous commands, restarting nodes, etc. Nothing seems to recover the cluster access.

Command to create:

az aks create `
  --name AKS-Cluster-VoterDemo `
  --resource-group RG-EastUS-AKS-VoterDemo `
  --node-count 1 `
  --generate-ssh-keys `
  --kubernetes-version 1.8.2

Perfectly healthy.

Command to scale up:

az aks scale `
  --name AKS-Cluster-VoterDemo `
  --resource-group RG-EastUS-AKS-VoterDemo `
  --node-count 3

Result: Unable to connect to the server: net/http: TLS handshake timeout

wtam commented 6 years ago

I encounter the same TLS handshake timeout connection issue after I manually scale the node count from 1 to 2! My cluster is in Central US What's wrong?

slack commented 6 years ago

Thanks for your patience through our preview.

We've had a few bugs in scale and upgrade paths that prevented the api-server from passing its health check after upgrade and/or scale. A number of bug fixes in this area went out over the last few weeks that have made upgrades more reliable.

Last week, for clusters in East US, we had an operational issue that impacted a number of older customer clusters between 12/11 13:00PST and 12/12 16:01PST.

Health and liveness of the api-server is now much better. If you haven't upgraded recently I'd recommend issuing az aks upgrade, even to the same kubernetes-version, as that will push the latest configuration to clusters. This rollout step is currently being automated and should be transparent in the future.

acesyde commented 6 years ago

@slack thank you it work ;)

wtam commented 6 years ago

@slack Confirm upgrading the cluster to 1.8.2 get the Kubectl connect again

aleksen commented 6 years ago

@slack Having the same problem still after upgrading to 1.8.2 in westeurope. Is there a problem in that region?

douglaswaights commented 6 years ago

After downgrading to 2.0.23 i was able to install the cluster but after getting the credentials downloaded I also have the same problem in westeurope...

kubectl get nodes Unable to connect to the server: net/http: TLS handshake timeout

doing an az aks upgrade to 1.8.2 failed for me too incidentally.

jakobli commented 6 years ago

running into the same issue, cluster in West Europe, upgrade to 1.8.2 fails with: Deployment failed. Correlation ID: 858d3cf0-0d4e-417d-a2ee-22f627892e51. Operation failed with status: 200. Details: Resource state Failed

aleksen commented 6 years ago

@jakobli the error message you get I got when I hit my CPU quota. Are you sure you have extra D2 CPUs available? If I'm not wrong, aks commissions new vms before putting down the old ones

jakobli commented 6 years ago

@aleksen Thanks for the tip, but I checked we have loads of quota left for D2.

jakobli commented 6 years ago

So tried deploying directly with version 1.8.2 Deploy with without any issues but kubectl get nodes still gets Unable to connect to the server: net/http: TLS handshake timeout

Karreg commented 6 years ago

The bug has been closed on github but the issue is still there. None of the proposed fix work in westeurope. Can someone reopen this issue? It's pending for a month and there's no resolution 😞

D43m0n commented 6 years ago

I'm seeing the same issue too after deploying a new cluster (v1.7.7) in westeurope:

$ kubectl get nodes
Unable to connect to the server: net/http: TLS handshake timeout
fox1t commented 6 years ago

Hi, same issue here. I just created a cluster using AKS in westeurope reagion and i am unable to connect to it. kubectl get no

Unable to connect to the server: net/http: TLS handshake timeout

Are someone looking actively into this?

tslavik commented 6 years ago

Interesting. I had the same problem with cluster created yesterday. An hour before I deleted an old and created a new one (v1.8.1 - westeurope) using the same service principal and it works.

peskybp commented 6 years ago

Problem definitely still exists. Hitting it with a v1.8.6 cluster in eastus. Have seen it across numerous versions, and I can no longer justify simply "recreating a cluster" as a workaround.

$ kubectl get nodes Unable to connect to the server: net/http: TLS handshake timeout

Can we please get some visibility into the actual status of properly fixing this? Very hard to justify using AKS for production when we don't have any idea when it is going to fail. Worse, there are sufficient functional differences between ACS and AKS to make swapping back and forth a non-starter as well...

rafnijs commented 6 years ago

deployed AKS in Azure region west-Europe I have the same problem: .\kubectl get nodes Unable to connect to the server: net/http: TLS handshake timeout

tiborb commented 6 years ago

Having an issue suddenly when trying to use the CLI.

kubectl get pods
Unable to connect to the server: net/http: TLS handshake timeout

https://azure.microsoft.com/en-us/status/ status seems to be ok location westeurope

rafnijs commented 6 years ago

Yes, The same behavior. I delete and recreate the cluster and it works fine now.

tiborb commented 6 years ago

@rafnijs recreate the cluster? imho that solution seems quite a bit radical to me

tiborb commented 6 years ago

az login sometimes solves the problem, seems to be a random issue

peskybp commented 6 years ago

@tiborb Completely agree. The only "solution" that has been touted is delete and recreate. Which would be fine if there was an easy way to transfer everything, but...

(1) AKS Master API being non-responsive means I can't even dump out the current cluster config, so no easy "migration" and surely no way to know when it would randomly fail.

(2) AKS uses a managed resource group, which if I delete the existing cluster ALSO GETS DELETED. That means any Managed Disks that were oh so pleasantly created for me get lost on cluster delete unless I manually migrate EACH ONE before hand.

Its been about 5 days for me now in this state, where access returns intermittently. For those same 5 days I have had a support ticket open with Azure and have gotten basically nothing back except "I will check with the dev team". Its been nearly 48 hours now since the last response from them, and I am thinking of just ditching the Azure based implementation as it clearly isn't ready for real world use.

fox1t commented 6 years ago

I spoke with MS Azure Italia on phone 2 days ago and they pointed me out that AKS is still in preview. When a service is flagged like that, it is stil not production ready and the support they offer isn't active. For the moment the best way to spin up a kubernetes cluster on azure is ACS with orchestrator set to kubernetes. We use it in production since 1 year and it never has any problem at all.

tiborb commented 6 years ago

My orchestrator is: Kubernetes version V1.8.1

I suppose it should be stable.

fox1t commented 6 years ago

It is not a kubernetes problem, even if it seems like that. It is a AKS that is not handling correctly cluster creation and networking. As i said, just use ACS with kubernetes instead of AKS and it will work like a charm.

fox1t commented 6 years ago

You can find ACS CLI commands here: https://docs.microsoft.com/en-us/cli/azure/acs?view=azure-cli-latest az acs create -g MyResourceGroup -n MyContainerService --orchestrator-type kubernetes --generate-ssh-keys

SurushS commented 6 years ago

I have had AKS running on Westeurope for about 4 months now running 1.8.1. Somehow this morning lost all connectivity with all the containers running and the dashboard through az aks browse. After doing an az aks upgrade to 1.8.7 everything is running normal again. It's a shame I don't know what caused this issue. Hopefully this stuff doesn't happen when it goes GA.

ghost commented 6 years ago

hello guys, i have the same problem. this problem occur on windows 10 pro, docker ce edge 18.02.0-ce-win52(15372) when adding new services, pods to the cluster. after increasing memory for docker process problem solved temporarily. is it possible to add some checks(available memory, cpu, etc ) in adding new pods, deployments, services to the cluster?

TonyGorman commented 6 years ago

Also happening on my cluster in EUW

Zehelein commented 6 years ago

Same for me - does not work anymore in EUW

alonisser commented 6 years ago

Same for me, suddenly stopped connecting with the TLS error, west europe 1.8.7

gvanderberg commented 6 years ago

Same here, cannot connect to my aks cluster anymore. Tried using PowerShell Azure CLI and Bash - Ubuntu - Azure CLI.

Managed to resolve the issue by running the below:

az aks upgrade --resource-group removed --name removed --kubernetes-version 1.8.1

It is worth noting that I upgraded to the same version my cluster was already on, 1.8.1 -> 1.8.1

xxx:~$ kubectl config current-context companyname xxx:~$ kubectl proxy Starting to serve on 127.0.0.1:8001 I0308 18:02:14.211741 18892 logs.go:41] http: proxy error: net/http: TLS handshake timeout xxx:~$ kubectl get nodes Unable to connect to the server: net/http: TLS handshake timeout