Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 310 forks source link

[BUG] Cluster workloads to the kubernetes API server will intermittently timeout or takes minutes to complete #3047

Open arsnyder16 opened 2 years ago

arsnyder16 commented 2 years ago

Describe the bug Requests from cluster workloads to the kubernetes API server will intermittently timeout or takes minutes to complete, depending on the workloads request settings.

To Reproduce Steps to reproduce the behavior: Provision a new SLA enabled cluster

rg=
acr=
aks=
workspaceId=
location=northcentralus
aksVersion=1.22.6
az aks create \
  --resource-group $rg \
  --name $aks \
  --vm-set-type VirtualMachineScaleSets \
  --node-count 2 \
  --node-vm-size Standard_DS2_v2 \
  --node-osdisk-size 64 \
  --node-osdisk-type Ephemeral \
  --generate-ssh-keys \
  --kubernetes-version $aksVersion \
  --attach-acr $acr \
  --load-balancer-sku standard \
  --location $location \
  --enable-managed-identity \
  --uptime-sla \
  --enable-encryption-at-host \   
  --enable-addons monitoring \
  --workspace-resource-id $workspaceId

This may be optional but might help produce the problem. Install nginx ingress, and you can just leave the replicas as 0 to avoid adding any more noise to the cluster

helm upgrade --install nginx ingress-nginx/ingress-nginx \
  --create-namespace \
  --namespace ingress \
  --set controller.replicaCount=0 \
  --set controller.service.externalTrafficPolicy=Local \
  --wait

deploy a simple workload that just used kubectl to list the pods in a namespace, these jobs will fail once they detect the issue.

kubectl create namespace api-issue
kubectl apply -f - << EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: slowrequest
  namespace: api-issue
spec:
  parallelism: 5
  completions: 5
  backoffLimit: 2
  template:  
    metadata:
      labels:
        app: slowrequest
    spec:
      restartPolicy: Never
      containers:
      - name: slowrequest
        image: bitnami/kubectl
        imagePullPolicy: IfNotPresent   
        command: 
          - /bin/sh 
        args: 
          - -c
          - set -e; while true; do kubectl get pods -n=default --selector=app=my-api --v=9 --output=json; sleep 260s; done
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: internal-kubectl-role
  namespace: api-issue
rules:
- apiGroups: [""]
  resources: ["pods", "pods/status"]
  verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: internal-kubectl-role-binding
subjects:
- kind: ServiceAccount
  name: default
  namespace: api-issue
roleRef:
  kind: ClusterRole
  name: internal-kubectl-role
  apiGroup: rbac.authorization.k8s.io
EOF

With the kubectl example above it will manifest in a timeout trying to do TLS handshake. What is strange about the kubectl log output is it does seem to have the response body, but it is show as a header.

Here is an example of succesful run

I0503 15:02:34.479797       7 round_trippers.go:466] curl -v -XGET  -H "Accept: application/json" -H "User-Agent: kubectl/v1.23.6 (linux/amd64) kubernetes/ad33385" -H "Authorization: Bearer <masked>" 'https://10.0.0.1:443/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500'
I0503 15:02:34.504364       7 round_trippers.go:570] HTTP Statistics: GetConnection 0 ms ServerProcessing 24 ms Duration 24 ms
I0503 15:02:34.504383       7 round_trippers.go:577] Response Headers:
I0503 15:02:34.504390       7 round_trippers.go:580]     Date: Tue, 03 May 2022 15:02:34 GMT
I0503 15:02:34.508215       7 round_trippers.go:580]     Audit-Id: 0a8166f8-d538-4b02-8e1b-4fd050760999
I0503 15:02:34.508239       7 round_trippers.go:580]     Cache-Control: no-cache, private
I0503 15:02:34.508298       7 round_trippers.go:580]     Content-Type: application/json
I0503 15:02:34.508314       7 round_trippers.go:580]     X-Kubernetes-Pf-Flowschema-Uid: fae6b7ca-e682-4669-94eb-85163e201928
I0503 15:02:34.508319       7 round_trippers.go:580]     X-Kubernetes-Pf-Prioritylevel-Uid: 2a307f0a-f367-4f1d-ba5c-1bc4c330d0f1
I0503 15:02:34.508325       7 round_trippers.go:580]     Content-Length: 91
I0503 15:02:34.508374       7 request.go:1181] Response Body: {"kind":"PodList","apiVersion":"v1","metadata":{"resourceVersion":"251974943"},"items":[]}
{
    "apiVersion": "v1",
    "items": [],
    "kind": "List",
    "metadata": {
        "resourceVersion": "",
        "selfLink": ""
    }
}

Here is an example of one that fails just a few minutes later

I0503 15:06:54.560888      14 round_trippers.go:466] curl -v -XGET  -H "Accept: application/json" -H "User-Agent: kubectl/v1.23.6 (linux/amd64) kubernetes/ad33385" -H "Authorization: Bearer <masked>" 'https://10.0.0.1:443/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500'
I0503 15:06:54.562933      14 round_trippers.go:510] HTTP Trace: Dial to tcp:10.0.0.1:443 succeed
I0503 15:07:04.564088      14 round_trippers.go:570] HTTP Statistics: DNSLookup 0 ms Dial 1 ms TLSHandshake 10000 ms Duration 10003 ms
I0503 15:07:04.564131      14 round_trippers.go:577] Response Headers:
{
    "apiVersion": "v1",
    "items": [],
    "kind": "List",
    "metadata": {
        "resourceVersion": "",
        "selfLink": ""
    }
}
I0503 15:07:04.565727      14 helpers.go:237] Connection error: Get https://10.0.0.1:443/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500: net/http: TLS handshake timeout
F0503 15:07:04.566119      14 helpers.go:118] Unable to connect to the server: net/http: TLS handshake timeout
goroutine 1 [running]:
k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0x1)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1038 +0x8a
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).output(0x3080040, 0x3, 0x0, 0xc000620000, 0x2, {0x25f2ec7, 0x10}, 0xc00005c400, 0x0)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:987 +0x5fd
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).printDepth(0xc0002e1a40, 0x40, 0x0, {0x0, 0x0}, 0x54, {0xc00032b470, 0x1, 0x1})
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:735 +0x1ae
k8s.io/kubernetes/vendor/k8s.io/klog/v2.FatalDepth(...)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1518
k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/util.fatal({0xc0002e1a40, 0x40}, 0xc0007664e0)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/util/helpers.go:96 +0xc5
k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/util.checkErr({0x1fed760, 0xc0007664e0}, 0x1e797d0)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/util/helpers.go:191 +0x7d7
k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/util.CheckErr(...)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/util/helpers.go:118
k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/get.NewCmdGet.func2(0xc0003e8780, {0xc000494460, 0x1, 0x5})
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/get/get.go:181 +0xc8
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute(0xc0003e8780, {0xc000494410, 0x5, 0x5})
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:860 +0x5f8
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc00056cf00)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:974 +0x3bc
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute(...)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:902
k8s.io/kubernetes/vendor/k8s.io/component-base/cli.run(0xc00056cf00)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/component-base/cli/run.go:146 +0x325
k8s.io/kubernetes/vendor/k8s.io/component-base/cli.RunNoErrOutput(...)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/component-base/cli/run.go:84
main.main()
    _output/dockerized/go/src/k8s.io/kubernetes/cmd/kubectl/kubectl.go:30 +0x1e

I experience different behaviors with different clients. For example I have a simple nodejs app that does the same thing by just listing the pods through the k8s sdk. In this environment i will get situations where the requests will take upwards of 5 minutes to complete

'use strict';
const process = require('process');
const k8s = require('@kubernetes/client-node');
const kc = new k8s.KubeConfig();
kc.loadFromCluster();
const k8Core = kc.makeApiClient(k8s.CoreV1Api);
let startTime; // custom log offset to help correlate with tcp dump
const log = msg => console.log(`${(new Date() - startTime) / 1000.0} ${msg}`);
let running;
let interval;
const listPods = async ()=>{
  if (running) {
    return;
  }
  running = true;
  log('Listing pods...');
  const listStart = new Date();
  const { body: { items } } = await k8Core.listNamespacedPod('default');
  const seconds = (new Date() - listStart) / 1000.0;
  log(`Found ${items.length} pods in ${seconds} seconds`);
  if(seconds > 60) {
    log(`Closing because this seems excessive`);
    process.exitCode = -1;
    clearInterval(interval);
    return;
  }
  running = false;
};
setTimeout(()=>{
  startTime = new Date();
  listPods();
  interval = setInterval(listPods, 215 * 1000);
}, 1000)
2022-06-28T02:04:30.490394199Z 36564.257 Listing pods...
2022-06-28T02:04:30.525501046Z 36564.292 Found 0 pods in 0.035 seconds
2022-06-28T02:08:05.550325353Z 36779.317 Listing pods...
2022-06-28T02:13:37.847998556Z 37111.614 Found 0 pods in 332.297 seconds  <--- Not great
2022-06-28T02:13:37.848028156Z 37111.614 Closing because this seems excessive

Expected behavior Requests should be able to complete in a reasonable amount of time. I see this across many clusters some times every few minutes. To eliminate all cluster specific variables, this is a bare bones replication of the issue, so should not suffer from user workloads effecting performance.

Environment (please complete the following information):

Additional context This seemed to start once we upgraded clusters from 1.20 to 1.21. I first opened a request ticket with support in January, but it has since been in the support death spiral and has gotten no where an yet to reached a team able to diagnosis or even attempt to reproduce with the simple steps above. I have sent tcpdumps, kubelet logs, etc

This is not specific to any requests we see it across many different requests. We have various workloads that may monitor the cluster using the API or dynamically create or modify workloads through the API.

Have yet been able to reproduce outside of the cluster seems to be very specific to cluster to control plane communication

This only seems to be a problem on SLA enabled clusters. Openvpn, aks-link issue? I don't see any recycling of aks-link or anything useful in the logs.

I am really curious if Konnectivity resolves the problem, but i have yet to see it make to any of my various clusters which are across many different data centers.

arsnyder16 commented 2 years ago

I am able to reproduce this issue with 1.23.8 that uses Konnectivity as well

arsnyder16 commented 2 years ago

After some more investigation, nginx ingress is not actually required.

I have added workloads and a repro.sh script to the following repo https://github.com/arsnyder16/aks-api-issue

In summary,

After this there seems to be no issue after running these workloads for a day or more

To start having issues with the API REST calls all that needs added to the cluster is a Service of type LoadBalancer, it doesn't need to point any actual backend pods. After adding the service the original workloads will start having intermittent failures with the calls to the API server.

ghost commented 2 years ago

Action required from @Azure/aks-pm

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

JohnRusk commented 2 years ago

@arsnyder16 It would be useful if node-side issues could be eliminated as a possible cause. I notice that your test cluster has only two nodes, and they have only 2 cores each. That makes me wonder if those nodes are getting overloaded (presumably with kube-system stuff) and if that node-side overload is triggering the problems.

Would you be able to do the same test, but with

Docs on the above can be found here, in case you want to refer to them: https://learn.microsoft.com/en-us/azure/aks/use-system-pools

arsnyder16 commented 2 years ago

@JohnRusk Hey John, my repro steps are certainly bare minimum and thats why there are only 2 nodes, but we see this issue many different productions clusters as well that have more nodes. So unfortunately node count does not seem related.

arsnyder16 commented 2 years ago

@JohnRusk I have should also mention that i worked Microsoft support at one point and we tried with 4 core machines as well. No luck.

I have yet to try running my workloads in a non system pool though, soi can test that for you

JohnRusk commented 2 years ago

Thanks @arsnyder16 . That test will help us be sure that we're looking at the right things.

JohnRusk commented 2 years ago

@arsnyder16 If that test also reproduces the issue, could you please mention that fact here and also email me directly. My GitHub profile says how to construct my email address.

arsnyder16 commented 2 years ago

@JohnRusk Yea no problem, just setting up everything you mentioned to cover all bases

az aks nodepool add --resource-group $rg --cluster-name $aks \
  --name nodepoola \
  --node-count 3 \
  --mode System \
  --node-osdisk-size 128 \
  --node-osdisk-type Ephemeral \
  --node-vm-size Standard_DS3_v2  \
  --enable-encryption-at-host 

az aks nodepool add --resource-group $rg --cluster-name $aks \
  --name nodepoolb \
  --node-count 3 \
  --mode User \
  --node-osdisk-size 128 \
  --node-osdisk-type Ephemeral \
  --node-vm-size Standard_DS3_v2  \
  --enable-encryption-at-host 
arsnyder16 commented 2 years ago

I am still able to reproduce the issue in both node pools as described in previous post. One thing to note is my cluster didn't have nginx ingress installed originally the workload ran fine for 15 hours, within an hour of adding the ingress (with zero replicas) the issue started. This seems to me to point to the public ip being provisioned maybe related to the load balancer for the cluster?

helm upgrade --install nginx ingress-nginx/ingress-nginx \
  --create-namespace \
  --namespace ingress \
  --set controller.replicaCount=0 \
  --set controller.service.externalTrafficPolicy=Local \
  --wait
JohnRusk commented 2 years ago

Interesting... Thanks for your email. I just sent a reply.

JohnRusk commented 2 years ago

Progress update: we've identified that the delays are happening during the TLS Handshake.

klolos commented 2 years ago

Progress update: we've identified that the delays are happening during the TLS Handshake.

@JohnRusk Could you please share some more information on the issue you have identified? We could be bumping into the same issue in one of our own cluster, where several requests towards the AKS API server fail with a TLS handshake timeout, causing frequent failures to services running in the cluster:

F0929 08:31:27.272615       1 config.go:37] Get "https://10.0.0.1:443/api?timeout=32s": net/http: TLS handshake timeout

Update: We were able to also reproduce the error using the Job manifest provided above, so it seems likely that it is indeed the same issue:

I0929 15:40:10.099624     323 round_trippers.go:466] curl -v -XGET  -H "Accept: application/json" -H "User-Agent: kubectl/v1.25.2 (linux/amd64) kubernetes/5835544" -H "Authorization: Bearer <masked>" '[https://10.0.0.1:443/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500](https://10.0.0.1/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500)'
I0929 15:40:10.101313     323 round_trippers.go:510] HTTP Trace: Dial to tcp:10.0.0.1:443 succeed
I0929 15:40:20.102170     323 round_trippers.go:553] GET [https://10.0.0.1:443/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500](https://10.0.0.1/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500)  in 10002 milliseconds
I0929 15:40:20.102204     323 round_trippers.go:570] HTTP Statistics: DNSLookup 0 ms Dial 1 ms TLSHandshake 10000 ms Duration 10002 ms
I0929 15:40:20.102212     323 round_trippers.go:577] Response Headers:
{
    "apiVersion": "v1",
    "items": [],
    "kind": "List",
    "metadata": {
        "resourceVersion": ""
    }
}
I0929 15:40:20.102511     323 helpers.go:264] Connection error: Get [https://10.0.0.1:443/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500](https://10.0.0.1/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500): net/http: TLS handshake timeout
Unable to connect to the server: net/http: TLS handshake timeout
JohnRusk commented 2 years ago

@klolos We have not found the root cause yet.

FYI, in the examples I've looked at, the issue is transient. E.g. if the same operation is retried immediately after the failure it will succeed. I.e. it will sucessully create a new connection, and do a successful handshake. Is that what you see? Or do you see the problem happening over and over again in a short period of time?

klolos commented 2 years ago

@JohnRusk Yes, the error seems transient. We are running recurring Kubeflow pipelines (which use Argo workflows), and once every few runs a pipeline step will fail because if this. Subsequent runs may succeed, so the error seems to be happening randomly.

JohnRusk commented 2 years ago

Thanks @klolos . Your issue sounds like it could indeed be the same as what we are looking at in this GitHub issue. We have gathered a lot of information in discussion with @arsnyder16. The root cause does not seem to be in the API Server itself (my area of expertise) so I'm reaching out to some colleagues here at AKS. It's a tricky issue, so the work may take some time.

ghost commented 2 years ago

Action required from @Azure/aks-pm

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

rbnis commented 2 years ago

We seem to be facing the same Issue in one of our AKS Clusters in the location westeurope.

@JohnRusk Did you got any more information on this issue?

mllab-nl commented 2 years ago

@arsnyder16 Thank you for such a nice BUG report

@JohnRusk We are experiencing the same issue: Running within the cluster kubectl sometimes bails out with: Unable to connect to the server: net/http: TLS handshake timeout after 10 seconds API server availability is set to 99.5% There are no logs from API server that indicate any kind of restart. Actually same API server pods are running for 15 days produced more than 30 of the errors

The connection is made to 10.0.0.1 that has a single endpoint - the public IP of the API server. It would be very strange if anything in the cluster would have decisive effect on establishing TLS session with a external public IP. Thus it looks like a problem in the part of AKS API service that connects an api service pods(Run by Azure) with the client (kubeclt)

AKS: 1.24.6 Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-09T13:36:36Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.6", GitCommit:"6c23b67c202a4cfa7c76c3e1b370bd5f0e654f30", GitTreeState:"clean", BuildDate:"2022-11-09T17:13:23Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}

JohnRusk commented 1 year ago

Our networking experts are looking into this. Due to the nature of the issue, it will take a bit more time, I'm sorry to say.

magnus-longva-bouvet commented 1 year ago

I'm experiencing the same issue. A support ticket has been created.

ghost commented 1 year ago

Action required from @Azure/aks-pm

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

Alexander-Bartosh commented 1 year ago

@JohnRusk any update on the issue ? We are also experiencing it on a regular basis

JohnRusk commented 1 year ago

@Alexander-Bartosh Are you timeouts specifically relating to TLS Handshake failures, and are they rare? (E.g. less than 1% of all handshakes fail). If so, then yes, they sound a lot like this issue. If not, they may be something different.

In terms of progress, one of my colleagues has made good progress recently on what has turned out to be an unexpectedly complex issue. I don't have any dates to share at this stage.

JohnRusk commented 1 year ago

cc @arsnyder16 ^

arsnyder16 commented 1 year ago

@JohnRusk Thanks for the update. Just for clarity. It seems as though your colleague has been able to replicate reliability(maybe not frequently) in their environment to work on a potential resolution? If so that is certainly more progress than previously reported and is a good sign.

JohnRusk commented 1 year ago

They have actually set up a repro which is reliable and frequent, by changing a few things, including some internal parameters (basically, forcing the problem to happen). As far as we can tell, your version of the problem appears to be simply a lower frequency version of what my colleague is now able to reproduce. And he has a solution. It needs internal review, implementation and rollout - which will take some time. I don't have any dates to share, I'm sorry.

arsnyder16 commented 1 year ago

It’s great that it seems to have been isolated and solutions have been identified.

Totally understandable on the time frame.

locomoco28 commented 1 year ago

We've been experiencing TLS Handshake timeout issues for a few months, each timeout issue lasting for a couple minutes. At first it was quite rare, but around end of January it was getting more. We got in touch with Gitlab Support as we were experiencing the issue in our CI/CD pipeline, but were not able to figure out the issue.

It eventually was getting less, but it has been getting more again. I just retried a CI job 8 times over the span of half an hour for it to succeed.

So for now we just have to wait for the fix to be ready? It's been quite cumbersome dealing with this, especially for developers who don't quite understand why their CI pipeline fails.

CVanF5 commented 1 year ago

@locomoco28 We had similar problems with our CI jobs but as mentioned in https://github.com/Azure/AKS/issues/3047#issuecomment-1196764723 it only happens if a LoadBalancer is connected to the cluster. I verified it by deleting our ingress loadbalancer and the issue disappears. We've since moved our CI runners to a cluster without a LoadBalancer and since then I consider this issue mitigated. This issue is costing us extra money so I'm looking forward to the fix.

locomoco28 commented 1 year ago

Yes I saw the comment, but I do not want to provision a whole new cluster solely for the purpose of running CI jobs as I already got a cluster just for tools. I'd much rather host the CI jobs in the same cluster where other tools like Gitlab are hosted already.

uncycler commented 1 year ago

We always experienced some kind of TLS timeout with AKS' api but since few weeks, it's really painful. I would say that about 10-15% of queries are timing out. And this is from workloads running inside the cluster itself.

net/http: TLS handshake timeout on https://10.0.0.1:443/api/v1

Any news regarding the fix?

JohnRusk commented 1 year ago

@uncycler The rate you mention, of 10 to 15%, is about 100x higher than what I've seen with the TLS handshake bug. I wonder if yours may have a different cause..... Or else it's just a very severe case of the issue.

One question, what language is your client app written in? (Asking because we've noticed that the Go Kubernetes client has a retry policy for TLS handshakes, but I'm not sure if other K8s clients have the same thing).

uncycler commented 1 year ago

I see those handshake TLS errors with python, go, java. Most of them are retrying requests so the issue is almost mitigated. For example, I have a python app running inside the cluster that watch resources (using official kube client) and the watch must be restarted about every 5-10 minutes.

But i'm doing deployment with terraform inside the cluster (for backend storage as well) and helm (the lookup function) and both won't retry if the API is not accessible.

But worst of all, are the gitlab-runners like other mentionned. The runner is not retrying requests after k8s API errors and the whole job needs to be retried. And since few weeks, quite often, the retried job just hangs (not sure is only related to API timeouts..)

So, the 10-15% comes from there, since we need to constantly retrying jobs. And it's way worst than it uses to be.

marcelhintermann commented 1 year ago

We are encountering TLS handshake errors and "unexpected EOF" errors when making calls to the Kubernetes API. We have implemented the Kubernetes Retry Policy in our apps, which has mitigated the TLS handshake issue. However, despite adding a retry pattern to the GitLab Runner (which is also written in Golang), we are still experiencing "unexpected EOF" errors.

ERROR: Job failed (system failure): prepare environment: error sending request: Post "https://10.244.0.1:443/api/v1/namespaces/gitlab/pods/runner-bzyjak2a-project-102-concurrent-16c49v5/attach?container=helper&stdin=true": unexpected EOF.
ERROR: Job failed (system failure): prepare environment: error sending request: Post "https://10.244.0.1:443/api/v1/namespaces/gitlab/pods/runner-5x5g6zpk-project-96-concurrent-59p72qh/exec?command=sh&command=-c&command=if+%5B+-x+%2Fusr%2Flocal%2Fbin%2Fbash+%5D%3B+then%0A%09exec+%2Fusr%2Flocal%2Fbin%2Fbash+%0Aelif+%5B+-x+%2Fusr%2Fbin%2Fbash+%5D%3B+then%0A%09exec+%2Fusr%2Fbin%2Fbash+%0Aelif+%5B+-x+%2Fbin%2Fbash+%5D%3B+then%0A%09exec+%2Fbin%2Fbash+%0Aelif+%5B+-x+%2Fusr%2Flocal%2Fbin%2Fsh+%5D%3B+then%0A%09exec+%2Fusr%2Flocal%2Fbin%2Fsh+%0Aelif+%5B+-x+%2Fusr%2Fbin%2Fsh+%5D%3B+then%0A%09exec+%2Fusr%2Fbin%2Fsh+%0Aelif+%5B+-x+%2Fbin%2Fsh+%5D%3B+then%0A%09exec+%2Fbin%2Fsh+%0Aelif+%5B+-x+%2Fbusybox%2Fsh+%5D%3B+then%0A%09exec+%2Fbusybox%2Fsh+%0Aelse%0A%09echo+shell+not+found%0A%09exit+1%0Afi%0A%0A&container=helper&container=helper&stderr=true&stdin=true&stdout=true": unexpected EOF. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information
Job failed (system failure): prepare environment: setting up trapping scripts on emptyDir: error sending request: Post "https://10.244.0.1:443/api/v1/namespaces/gitlab/pods/runner-21sqfvz-project-318-concurrent-1ftx2g/exec?command=sh&command=-c&command=if+%5B+-x+%2Fusr%2Flocal%2Fbin%2Fbash+%5D%3B+then%0A%09exec+%2Fusr%2Flocal%2Fbin%2Fbash+%0Aelif+%5B+-x+%2Fusr%2Fbin%2Fbash+%5D%3B+then%0A%09exec+%2Fusr%2Fbin%2Fbash+%0Aelif+%5B+-x+%2Fbin%2Fbash+%5D%3B+then%0A%09exec+%2Fbin%2Fbash+%0Aelif+%5B+-x+%2Fusr%2Flocal%2Fbin%2Fsh+%5D%3B+then%0A%09exec+%2Fusr%2Flocal%2Fbin%2Fsh+%0Aelif+%5B+-x+%2Fusr%2Fbin%2Fsh+%5D%3B+then%0A%09exec+%2Fusr%2Fbin%2Fsh+%0Aelif+%5B+-x+%2Fbin%2Fsh+%5D%3B+then%0A%09exec+%2Fbin%2Fsh+%0Aelif+%5B+-x+%2Fbusybox%2Fsh+%5D%3B+then%0A%09exec+%2Fbusybox%2Fsh+%0Aelse%0A%09echo+shell+not+found%0A%09exit+1%0Afi%0A%0A&container=build&container=build&stderr=true&stdin=true&stdout=true": unexpected EOF. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information

Could you please provide an estimated time frame for the resolution of this issue? The intermittent failure of the pipeline is causing significant inconvenience and confusion for developers.

marcelhintermann commented 1 year ago

@uncycler: the constantly hanging jobs might be this bug in GitLab when you have the retry keyword in the pipeline.

marcelhintermann commented 1 year ago

Hi @JohnRusk,

Would you be able to share any updates on the fix and an estimated timeline for its availability? We have an open support case related to this issue, and we'd be more than willing to test and provide feedback on the solution once it's ready.

Thanks for your assistance!

JohnRusk commented 1 year ago

@marcelhintermann I'm following up with a colleague regarding your question.

JohnRusk commented 1 year ago

@marcelhintermann I've heard back from my colleague. Work is underway. There are several different parts to the work. We don't have exact timelines to share at this stage, I'm sorry. I will check again with my colleague later this month.

nat45928 commented 1 year ago

We are also seeing this same issue on our AKS cluster. Lots of ERROR: Job failed (system failure): error sending request: Post "https://10.2.0.1:443/api/....": unexpected EOF type failures. Our nodes are on v1.24.10 and we use multiple pools.

JohnRusk commented 1 year ago

@nat45928 I'm not certain that what you're seeing is the same issue. This thread is focussed on TLS Handshake errors. Your error example is "unexpected EOF". That could be something after the handshake stage. It could in fact be EOF while awaiting a result from the operation. If that's the case, it almost certainly has a different root cause that what we are looking at here, I'm sorry.

On the other hand, if you have some evidence that the EOF is happening during the TLS handshake phase, then in that case it might be what we're looking at here.

djgrubson commented 1 year ago

to bump this issue: we are experiencing similar issue after upgrading AKS to 1.24.9 ERROR: Job failed (system failure): error sending request: Post "https://10.xx.0.1:443/api or ERROR: Job failed (system failure): prepare environment: setting up build pod: Get "https://10.xx.0.1:443/version": http2: client connection lost.` around 100 events in 24h in our logging system

marcelhintermann commented 1 year ago

Hi @JohnRusk Is there an update about a timeline where a fix for this can be expected? We still have those problems and it still impacts our customers. Thanks in advance

JohnRusk commented 1 year ago

Marcel, I haven't been directly involved in our work on this recently. I've asked one of our key folks to update this thread when there's news to share.

marcelhintermann commented 1 year ago

Hi @JohnRusk Sorry to bother you with this issue - is there any other person that can help with this issue? Our customers are frustrated because of this issue and we need to soon have a solution here.

uncycler commented 1 year ago

Maybe it can give some insights, but we are using npd in our clusters (https://github.com/kubernetes/node-problem-detector) and it is quite verbose regarding those timeouts. All our clusters show the same pattern:

E0426 04:33:10.268093       1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout                                                                                                                                                  
E0426 04:48:22.268583       1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout                                                                                                                                                  
E0426 05:13:34.268437       1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout                                                                                                                                                  
E0426 05:53:49.268865       1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout                                                                                                                                                  
E0426 06:29:04.272185       1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout                                                                                                                                                  
E0426 07:04:17.268565       1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout                                                                                                                                                  
E0426 07:44:31.269006       1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout                                                                                                                                                  
E0426 07:59:43.267919       1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout                                                                                                                                                  
E0426 11:10:15.268451       1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout                                                                                                                                                  
E0426 11:30:28.269038       1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout                                                                                                                                                  
E0426 14:00:56.268577       1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout                                                                                                                                                  
E0426 16:21:19.268940       1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout                                                                                                                                                  
E0426 19:11:47.269095       1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout                                                                                                                                                  
E0426 19:16:58.269021       1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout                                                                                                                                                  
E0426 19:42:11.268324       1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout                                                                                                                                                  
E0426 20:52:27.273455       1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout                                                                                                                                                  
E0426 21:02:39.268515       1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout                                                                                                                                                  
E0426 21:47:55.269623       1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout                                                                                                                                                  
E0426 23:18:17.269093       1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout                                                                                                                                                  
E0426 23:38:29.269256       1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout                                                                                                                                                  
E0426 23:48:40.268001       1 manager.go:162] failed to update node conditions: Patch "https://10.0.0.1:443/api/v1/nodes/aks-default0777-79917988-vmss0000mf/status": net/http: TLS handshake timeout                                               
JohnRusk commented 1 year ago

@marcelhintermann I'm no longer involved in this myself, as mentioned above. I'll see if I can get someone else to reply here.