Open arsnyder16 opened 2 years ago
I am able to reproduce this issue with 1.23.8 that uses Konnectivity as well
After some more investigation, nginx ingress is not actually required.
I have added workloads and a repro.sh script to the following repo https://github.com/arsnyder16/aks-api-issue
In summary,
After this there seems to be no issue after running these workloads for a day or more
To start having issues with the API REST calls all that needs added to the cluster is a Service of type LoadBalancer, it doesn't need to point any actual backend pods. After adding the service the original workloads will start having intermittent failures with the calls to the API server.
Action required from @Azure/aks-pm
Issue needing attention of @Azure/aks-leads
@arsnyder16 It would be useful if node-side issues could be eliminated as a possible cause. I notice that your test cluster has only two nodes, and they have only 2 cores each. That makes me wonder if those nodes are getting overloaded (presumably with kube-system stuff) and if that node-side overload is triggering the problems.
Would you be able to do the same test, but with
Docs on the above can be found here, in case you want to refer to them: https://learn.microsoft.com/en-us/azure/aks/use-system-pools
@JohnRusk Hey John, my repro steps are certainly bare minimum and thats why there are only 2 nodes, but we see this issue many different productions clusters as well that have more nodes. So unfortunately node count does not seem related.
@JohnRusk I have should also mention that i worked Microsoft support at one point and we tried with 4 core machines as well. No luck.
I have yet to try running my workloads in a non system pool though, soi can test that for you
Thanks @arsnyder16 . That test will help us be sure that we're looking at the right things.
@arsnyder16 If that test also reproduces the issue, could you please mention that fact here and also email me directly. My GitHub profile says how to construct my email address.
@JohnRusk Yea no problem, just setting up everything you mentioned to cover all bases
az aks nodepool add --resource-group $rg --cluster-name $aks \
--name nodepoola \
--node-count 3 \
--mode System \
--node-osdisk-size 128 \
--node-osdisk-type Ephemeral \
--node-vm-size Standard_DS3_v2 \
--enable-encryption-at-host
az aks nodepool add --resource-group $rg --cluster-name $aks \
--name nodepoolb \
--node-count 3 \
--mode User \
--node-osdisk-size 128 \
--node-osdisk-type Ephemeral \
--node-vm-size Standard_DS3_v2 \
--enable-encryption-at-host
I am still able to reproduce the issue in both node pools as described in previous post. One thing to note is my cluster didn't have nginx ingress installed originally the workload ran fine for 15 hours, within an hour of adding the ingress (with zero replicas) the issue started. This seems to me to point to the public ip being provisioned maybe related to the load balancer for the cluster?
helm upgrade --install nginx ingress-nginx/ingress-nginx \
--create-namespace \
--namespace ingress \
--set controller.replicaCount=0 \
--set controller.service.externalTrafficPolicy=Local \
--wait
Interesting... Thanks for your email. I just sent a reply.
Progress update: we've identified that the delays are happening during the TLS Handshake.
Progress update: we've identified that the delays are happening during the TLS Handshake.
@JohnRusk Could you please share some more information on the issue you have identified? We could be bumping into the same issue in one of our own cluster, where several requests towards the AKS API server fail with a TLS handshake timeout, causing frequent failures to services running in the cluster:
F0929 08:31:27.272615 1 config.go:37] Get "https://10.0.0.1:443/api?timeout=32s": net/http: TLS handshake timeout
Update: We were able to also reproduce the error using the Job manifest provided above, so it seems likely that it is indeed the same issue:
I0929 15:40:10.099624 323 round_trippers.go:466] curl -v -XGET -H "Accept: application/json" -H "User-Agent: kubectl/v1.25.2 (linux/amd64) kubernetes/5835544" -H "Authorization: Bearer <masked>" '[https://10.0.0.1:443/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500](https://10.0.0.1/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500)'
I0929 15:40:10.101313 323 round_trippers.go:510] HTTP Trace: Dial to tcp:10.0.0.1:443 succeed
I0929 15:40:20.102170 323 round_trippers.go:553] GET [https://10.0.0.1:443/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500](https://10.0.0.1/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500) in 10002 milliseconds
I0929 15:40:20.102204 323 round_trippers.go:570] HTTP Statistics: DNSLookup 0 ms Dial 1 ms TLSHandshake 10000 ms Duration 10002 ms
I0929 15:40:20.102212 323 round_trippers.go:577] Response Headers:
{
"apiVersion": "v1",
"items": [],
"kind": "List",
"metadata": {
"resourceVersion": ""
}
}
I0929 15:40:20.102511 323 helpers.go:264] Connection error: Get [https://10.0.0.1:443/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500](https://10.0.0.1/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500): net/http: TLS handshake timeout
Unable to connect to the server: net/http: TLS handshake timeout
@klolos We have not found the root cause yet.
FYI, in the examples I've looked at, the issue is transient. E.g. if the same operation is retried immediately after the failure it will succeed. I.e. it will sucessully create a new connection, and do a successful handshake. Is that what you see? Or do you see the problem happening over and over again in a short period of time?
@JohnRusk Yes, the error seems transient. We are running recurring Kubeflow pipelines (which use Argo workflows), and once every few runs a pipeline step will fail because if this. Subsequent runs may succeed, so the error seems to be happening randomly.
Thanks @klolos . Your issue sounds like it could indeed be the same as what we are looking at in this GitHub issue. We have gathered a lot of information in discussion with @arsnyder16. The root cause does not seem to be in the API Server itself (my area of expertise) so I'm reaching out to some colleagues here at AKS. It's a tricky issue, so the work may take some time.
Action required from @Azure/aks-pm
Issue needing attention of @Azure/aks-leads
We seem to be facing the same Issue in one of our AKS Clusters in the location westeurope
.
@JohnRusk Did you got any more information on this issue?
@arsnyder16 Thank you for such a nice BUG report
@JohnRusk We are experiencing the same issue: Running within the cluster kubectl sometimes bails out with: Unable to connect to the server: net/http: TLS handshake timeout after 10 seconds API server availability is set to 99.5% There are no logs from API server that indicate any kind of restart. Actually same API server pods are running for 15 days produced more than 30 of the errors
The connection is made to 10.0.0.1 that has a single endpoint - the public IP of the API server. It would be very strange if anything in the cluster would have decisive effect on establishing TLS session with a external public IP. Thus it looks like a problem in the part of AKS API service that connects an api service pods(Run by Azure) with the client (kubeclt)
AKS: 1.24.6 Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-09T13:36:36Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.6", GitCommit:"6c23b67c202a4cfa7c76c3e1b370bd5f0e654f30", GitTreeState:"clean", BuildDate:"2022-11-09T17:13:23Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}
Our networking experts are looking into this. Due to the nature of the issue, it will take a bit more time, I'm sorry to say.
I'm experiencing the same issue. A support ticket has been created.
Action required from @Azure/aks-pm
Issue needing attention of @Azure/aks-leads
@JohnRusk any update on the issue ? We are also experiencing it on a regular basis
@Alexander-Bartosh Are you timeouts specifically relating to TLS Handshake failures, and are they rare? (E.g. less than 1% of all handshakes fail). If so, then yes, they sound a lot like this issue. If not, they may be something different.
In terms of progress, one of my colleagues has made good progress recently on what has turned out to be an unexpectedly complex issue. I don't have any dates to share at this stage.
cc @arsnyder16 ^
@JohnRusk Thanks for the update. Just for clarity. It seems as though your colleague has been able to replicate reliability(maybe not frequently) in their environment to work on a potential resolution? If so that is certainly more progress than previously reported and is a good sign.
They have actually set up a repro which is reliable and frequent, by changing a few things, including some internal parameters (basically, forcing the problem to happen). As far as we can tell, your version of the problem appears to be simply a lower frequency version of what my colleague is now able to reproduce. And he has a solution. It needs internal review, implementation and rollout - which will take some time. I don't have any dates to share, I'm sorry.
It’s great that it seems to have been isolated and solutions have been identified.
Totally understandable on the time frame.
We've been experiencing TLS Handshake timeout issues for a few months, each timeout issue lasting for a couple minutes. At first it was quite rare, but around end of January it was getting more. We got in touch with Gitlab Support as we were experiencing the issue in our CI/CD pipeline, but were not able to figure out the issue.
It eventually was getting less, but it has been getting more again. I just retried a CI job 8 times over the span of half an hour for it to succeed.
So for now we just have to wait for the fix to be ready? It's been quite cumbersome dealing with this, especially for developers who don't quite understand why their CI pipeline fails.
@locomoco28 We had similar problems with our CI jobs but as mentioned in https://github.com/Azure/AKS/issues/3047#issuecomment-1196764723 it only happens if a LoadBalancer is connected to the cluster. I verified it by deleting our ingress loadbalancer and the issue disappears. We've since moved our CI runners to a cluster without a LoadBalancer and since then I consider this issue mitigated. This issue is costing us extra money so I'm looking forward to the fix.
Yes I saw the comment, but I do not want to provision a whole new cluster solely for the purpose of running CI jobs as I already got a cluster just for tools. I'd much rather host the CI jobs in the same cluster where other tools like Gitlab are hosted already.
We always experienced some kind of TLS timeout with AKS' api but since few weeks, it's really painful. I would say that about 10-15% of queries are timing out. And this is from workloads running inside the cluster itself.
net/http: TLS handshake timeout on https://10.0.0.1:443/api/v1
Any news regarding the fix?
@uncycler The rate you mention, of 10 to 15%, is about 100x higher than what I've seen with the TLS handshake bug. I wonder if yours may have a different cause..... Or else it's just a very severe case of the issue.
One question, what language is your client app written in? (Asking because we've noticed that the Go Kubernetes client has a retry policy for TLS handshakes, but I'm not sure if other K8s clients have the same thing).
I see those handshake TLS errors with python, go, java. Most of them are retrying requests so the issue is almost mitigated. For example, I have a python app running inside the cluster that watch resources (using official kube client) and the watch must be restarted about every 5-10 minutes.
But i'm doing deployment with terraform inside the cluster (for backend storage as well) and helm (the lookup function) and both won't retry if the API is not accessible.
But worst of all, are the gitlab-runners like other mentionned. The runner is not retrying requests after k8s API errors and the whole job needs to be retried. And since few weeks, quite often, the retried job just hangs (not sure is only related to API timeouts..)
So, the 10-15% comes from there, since we need to constantly retrying jobs. And it's way worst than it uses to be.
We are encountering TLS handshake errors and "unexpected EOF" errors when making calls to the Kubernetes API. We have implemented the Kubernetes Retry Policy in our apps, which has mitigated the TLS handshake issue. However, despite adding a retry pattern to the GitLab Runner (which is also written in Golang), we are still experiencing "unexpected EOF" errors.
ERROR: Job failed (system failure): prepare environment: error sending request: Post "https://10.244.0.1:443/api/v1/namespaces/gitlab/pods/runner-bzyjak2a-project-102-concurrent-16c49v5/attach?container=helper&stdin=true": unexpected EOF.
ERROR: Job failed (system failure): prepare environment: error sending request: Post "https://10.244.0.1:443/api/v1/namespaces/gitlab/pods/runner-5x5g6zpk-project-96-concurrent-59p72qh/exec?command=sh&command=-c&command=if+%5B+-x+%2Fusr%2Flocal%2Fbin%2Fbash+%5D%3B+then%0A%09exec+%2Fusr%2Flocal%2Fbin%2Fbash+%0Aelif+%5B+-x+%2Fusr%2Fbin%2Fbash+%5D%3B+then%0A%09exec+%2Fusr%2Fbin%2Fbash+%0Aelif+%5B+-x+%2Fbin%2Fbash+%5D%3B+then%0A%09exec+%2Fbin%2Fbash+%0Aelif+%5B+-x+%2Fusr%2Flocal%2Fbin%2Fsh+%5D%3B+then%0A%09exec+%2Fusr%2Flocal%2Fbin%2Fsh+%0Aelif+%5B+-x+%2Fusr%2Fbin%2Fsh+%5D%3B+then%0A%09exec+%2Fusr%2Fbin%2Fsh+%0Aelif+%5B+-x+%2Fbin%2Fsh+%5D%3B+then%0A%09exec+%2Fbin%2Fsh+%0Aelif+%5B+-x+%2Fbusybox%2Fsh+%5D%3B+then%0A%09exec+%2Fbusybox%2Fsh+%0Aelse%0A%09echo+shell+not+found%0A%09exit+1%0Afi%0A%0A&container=helper&container=helper&stderr=true&stdin=true&stdout=true": unexpected EOF. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information
Job failed (system failure): prepare environment: setting up trapping scripts on emptyDir: error sending request: Post "https://10.244.0.1:443/api/v1/namespaces/gitlab/pods/runner-21sqfvz-project-318-concurrent-1ftx2g/exec?command=sh&command=-c&command=if+%5B+-x+%2Fusr%2Flocal%2Fbin%2Fbash+%5D%3B+then%0A%09exec+%2Fusr%2Flocal%2Fbin%2Fbash+%0Aelif+%5B+-x+%2Fusr%2Fbin%2Fbash+%5D%3B+then%0A%09exec+%2Fusr%2Fbin%2Fbash+%0Aelif+%5B+-x+%2Fbin%2Fbash+%5D%3B+then%0A%09exec+%2Fbin%2Fbash+%0Aelif+%5B+-x+%2Fusr%2Flocal%2Fbin%2Fsh+%5D%3B+then%0A%09exec+%2Fusr%2Flocal%2Fbin%2Fsh+%0Aelif+%5B+-x+%2Fusr%2Fbin%2Fsh+%5D%3B+then%0A%09exec+%2Fusr%2Fbin%2Fsh+%0Aelif+%5B+-x+%2Fbin%2Fsh+%5D%3B+then%0A%09exec+%2Fbin%2Fsh+%0Aelif+%5B+-x+%2Fbusybox%2Fsh+%5D%3B+then%0A%09exec+%2Fbusybox%2Fsh+%0Aelse%0A%09echo+shell+not+found%0A%09exit+1%0Afi%0A%0A&container=build&container=build&stderr=true&stdin=true&stdout=true": unexpected EOF. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information
Could you please provide an estimated time frame for the resolution of this issue? The intermittent failure of the pipeline is causing significant inconvenience and confusion for developers.
@uncycler: the constantly hanging jobs might be this bug in GitLab when you have the retry keyword in the pipeline.
Hi @JohnRusk,
Would you be able to share any updates on the fix and an estimated timeline for its availability? We have an open support case related to this issue, and we'd be more than willing to test and provide feedback on the solution once it's ready.
Thanks for your assistance!
@marcelhintermann I'm following up with a colleague regarding your question.
@marcelhintermann I've heard back from my colleague. Work is underway. There are several different parts to the work. We don't have exact timelines to share at this stage, I'm sorry. I will check again with my colleague later this month.
We are also seeing this same issue on our AKS cluster. Lots of ERROR: Job failed (system failure): error sending request: Post "https://10.2.0.1:443/api/....": unexpected EOF
type failures. Our nodes are on v1.24.10 and we use multiple pools.
@nat45928 I'm not certain that what you're seeing is the same issue. This thread is focussed on TLS Handshake errors. Your error example is "unexpected EOF". That could be something after the handshake stage. It could in fact be EOF while awaiting a result from the operation. If that's the case, it almost certainly has a different root cause that what we are looking at here, I'm sorry.
On the other hand, if you have some evidence that the EOF is happening during the TLS handshake phase, then in that case it might be what we're looking at here.
to bump this issue:
we are experiencing similar issue after upgrading AKS to 1.24.9
[31;1mERROR:
Job failed (system failure): error sending request: Post "https://10.xx.0.1:443/api or
[31;1mERROR: Job failed (system failure): prepare environment: setting up build pod: Get "https://10.xx.0.1:443/version": http2: client connection lost.`
around 100 events in 24h in our logging system
Hi @JohnRusk Is there an update about a timeline where a fix for this can be expected? We still have those problems and it still impacts our customers. Thanks in advance
Marcel, I haven't been directly involved in our work on this recently. I've asked one of our key folks to update this thread when there's news to share.
Hi @JohnRusk Sorry to bother you with this issue - is there any other person that can help with this issue? Our customers are frustrated because of this issue and we need to soon have a solution here.
Maybe it can give some insights, but we are using npd in our clusters (https://github.com/kubernetes/node-problem-detector) and it is quite verbose regarding those timeouts. All our clusters show the same pattern:
E0426 04:33:10.268093 1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout
E0426 04:48:22.268583 1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout
E0426 05:13:34.268437 1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout
E0426 05:53:49.268865 1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout
E0426 06:29:04.272185 1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout
E0426 07:04:17.268565 1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout
E0426 07:44:31.269006 1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout
E0426 07:59:43.267919 1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout
E0426 11:10:15.268451 1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout
E0426 11:30:28.269038 1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout
E0426 14:00:56.268577 1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout
E0426 16:21:19.268940 1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout
E0426 19:11:47.269095 1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout
E0426 19:16:58.269021 1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout
E0426 19:42:11.268324 1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout
E0426 20:52:27.273455 1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout
E0426 21:02:39.268515 1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout
E0426 21:47:55.269623 1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout
E0426 23:18:17.269093 1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout
E0426 23:38:29.269256 1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout
E0426 23:48:40.268001 1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout
@marcelhintermann I'm no longer involved in this myself, as mentioned above. I'll see if I can get someone else to reply here.
Describe the bug Requests from cluster workloads to the kubernetes API server will intermittently timeout or takes minutes to complete, depending on the workloads request settings.
To Reproduce Steps to reproduce the behavior: Provision a new SLA enabled cluster
This may be optional but might help produce the problem. Install nginx ingress, and you can just leave the replicas as 0 to avoid adding any more noise to the cluster
deploy a simple workload that just used kubectl to list the pods in a namespace, these jobs will fail once they detect the issue.
With the kubectl example above it will manifest in a timeout trying to do TLS handshake. What is strange about the kubectl log output is it does seem to have the response body, but it is show as a header.
Here is an example of succesful run
Here is an example of one that fails just a few minutes later
I experience different behaviors with different clients. For example I have a simple nodejs app that does the same thing by just listing the pods through the k8s sdk. In this environment i will get situations where the requests will take upwards of 5 minutes to complete
Expected behavior Requests should be able to complete in a reasonable amount of time. I see this across many clusters some times every few minutes. To eliminate all cluster specific variables, this is a bare bones replication of the issue, so should not suffer from user workloads effecting performance.
Environment (please complete the following information):
Additional context This seemed to start once we upgraded clusters from 1.20 to 1.21. I first opened a request ticket with support in January, but it has since been in the support death spiral and has gotten no where an yet to reached a team able to diagnosis or even attempt to reproduce with the simple steps above. I have sent tcpdumps, kubelet logs, etc
This is not specific to any requests we see it across many different requests. We have various workloads that may monitor the cluster using the API or dynamically create or modify workloads through the API.
Have yet been able to reproduce outside of the cluster seems to be very specific to cluster to control plane communication
This only seems to be a problem on SLA enabled clusters. Openvpn, aks-link issue? I don't see any recycling of aks-link or anything useful in the logs.
I am really curious if Konnectivity resolves the problem, but i have yet to see it make to any of my various clusters which are across many different data centers.