[BUG] Cluster workloads to the kubernetes API server will intermittently timeout or takes minutes to complete

arsnyder16 commented 2 years ago

Describe the bug Requests from cluster workloads to the kubernetes API server will intermittently timeout or takes minutes to complete, depending on the workloads request settings.

To Reproduce Steps to reproduce the behavior: Provision a new SLA enabled cluster

rg=
acr=
aks=
workspaceId=
location=northcentralus
aksVersion=1.22.6

az aks create \
  --resource-group $rg \
  --name $aks \
  --vm-set-type VirtualMachineScaleSets \
  --node-count 2 \
  --node-vm-size Standard_DS2_v2 \
  --node-osdisk-size 64 \
  --node-osdisk-type Ephemeral \
  --generate-ssh-keys \
  --kubernetes-version $aksVersion \
  --attach-acr $acr \
  --load-balancer-sku standard \
  --location $location \
  --enable-managed-identity \
  --uptime-sla \
  --enable-encryption-at-host \   
  --enable-addons monitoring \
  --workspace-resource-id $workspaceId

This may be optional but might help produce the problem. Install nginx ingress, and you can just leave the replicas as 0 to avoid adding any more noise to the cluster

helm upgrade --install nginx ingress-nginx/ingress-nginx \
  --create-namespace \
  --namespace ingress \
  --set controller.replicaCount=0 \
  --set controller.service.externalTrafficPolicy=Local \
  --wait

deploy a simple workload that just used kubectl to list the pods in a namespace, these jobs will fail once they detect the issue.

kubectl create namespace api-issue
kubectl apply -f - << EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: slowrequest
  namespace: api-issue
spec:
  parallelism: 5
  completions: 5
  backoffLimit: 2
  template:  
    metadata:
      labels:
        app: slowrequest
    spec:
      restartPolicy: Never
      containers:
      - name: slowrequest
        image: bitnami/kubectl
        imagePullPolicy: IfNotPresent   
        command: 
          - /bin/sh 
        args: 
          - -c
          - set -e; while true; do kubectl get pods -n=default --selector=app=my-api --v=9 --output=json; sleep 260s; done
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: internal-kubectl-role
  namespace: api-issue
rules:
- apiGroups: [""]
  resources: ["pods", "pods/status"]
  verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: internal-kubectl-role-binding
subjects:
- kind: ServiceAccount
  name: default
  namespace: api-issue
roleRef:
  kind: ClusterRole
  name: internal-kubectl-role
  apiGroup: rbac.authorization.k8s.io
EOF

With the kubectl example above it will manifest in a timeout trying to do TLS handshake. What is strange about the kubectl log output is it does seem to have the response body, but it is show as a header.

Here is an example of succesful run

I0503 15:02:34.479797       7 round_trippers.go:466] curl -v -XGET  -H "Accept: application/json" -H "User-Agent: kubectl/v1.23.6 (linux/amd64) kubernetes/ad33385" -H "Authorization: Bearer <masked>" 'https://10.0.0.1:443/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500'
I0503 15:02:34.504364       7 round_trippers.go:570] HTTP Statistics: GetConnection 0 ms ServerProcessing 24 ms Duration 24 ms
I0503 15:02:34.504383       7 round_trippers.go:577] Response Headers:
I0503 15:02:34.504390       7 round_trippers.go:580]     Date: Tue, 03 May 2022 15:02:34 GMT
I0503 15:02:34.508215       7 round_trippers.go:580]     Audit-Id: 0a8166f8-d538-4b02-8e1b-4fd050760999
I0503 15:02:34.508239       7 round_trippers.go:580]     Cache-Control: no-cache, private
I0503 15:02:34.508298       7 round_trippers.go:580]     Content-Type: application/json
I0503 15:02:34.508314       7 round_trippers.go:580]     X-Kubernetes-Pf-Flowschema-Uid: fae6b7ca-e682-4669-94eb-85163e201928
I0503 15:02:34.508319       7 round_trippers.go:580]     X-Kubernetes-Pf-Prioritylevel-Uid: 2a307f0a-f367-4f1d-ba5c-1bc4c330d0f1
I0503 15:02:34.508325       7 round_trippers.go:580]     Content-Length: 91
I0503 15:02:34.508374       7 request.go:1181] Response Body: {"kind":"PodList","apiVersion":"v1","metadata":{"resourceVersion":"251974943"},"items":[]}
{
    "apiVersion": "v1",
    "items": [],
    "kind": "List",
    "metadata": {
        "resourceVersion": "",
        "selfLink": ""
    }
}

Here is an example of one that fails just a few minutes later

I0503 15:06:54.560888      14 round_trippers.go:466] curl -v -XGET  -H "Accept: application/json" -H "User-Agent: kubectl/v1.23.6 (linux/amd64) kubernetes/ad33385" -H "Authorization: Bearer <masked>" 'https://10.0.0.1:443/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500'
I0503 15:06:54.562933      14 round_trippers.go:510] HTTP Trace: Dial to tcp:10.0.0.1:443 succeed
I0503 15:07:04.564088      14 round_trippers.go:570] HTTP Statistics: DNSLookup 0 ms Dial 1 ms TLSHandshake 10000 ms Duration 10003 ms
I0503 15:07:04.564131      14 round_trippers.go:577] Response Headers:
{
    "apiVersion": "v1",
    "items": [],
    "kind": "List",
    "metadata": {
        "resourceVersion": "",
        "selfLink": ""
    }
}
I0503 15:07:04.565727      14 helpers.go:237] Connection error: Get https://10.0.0.1:443/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500: net/http: TLS handshake timeout
F0503 15:07:04.566119      14 helpers.go:118] Unable to connect to the server: net/http: TLS handshake timeout
goroutine 1 [running]:
k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0x1)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1038 +0x8a
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).output(0x3080040, 0x3, 0x0, 0xc000620000, 0x2, {0x25f2ec7, 0x10}, 0xc00005c400, 0x0)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:987 +0x5fd
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).printDepth(0xc0002e1a40, 0x40, 0x0, {0x0, 0x0}, 0x54, {0xc00032b470, 0x1, 0x1})
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:735 +0x1ae
k8s.io/kubernetes/vendor/k8s.io/klog/v2.FatalDepth(...)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1518
k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/util.fatal({0xc0002e1a40, 0x40}, 0xc0007664e0)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/util/helpers.go:96 +0xc5
k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/util.checkErr({0x1fed760, 0xc0007664e0}, 0x1e797d0)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/util/helpers.go:191 +0x7d7
k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/util.CheckErr(...)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/util/helpers.go:118
k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/get.NewCmdGet.func2(0xc0003e8780, {0xc000494460, 0x1, 0x5})
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/get/get.go:181 +0xc8
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute(0xc0003e8780, {0xc000494410, 0x5, 0x5})
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:860 +0x5f8
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc00056cf00)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:974 +0x3bc
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute(...)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:902
k8s.io/kubernetes/vendor/k8s.io/component-base/cli.run(0xc00056cf00)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/component-base/cli/run.go:146 +0x325
k8s.io/kubernetes/vendor/k8s.io/component-base/cli.RunNoErrOutput(...)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/component-base/cli/run.go:84
main.main()
    _output/dockerized/go/src/k8s.io/kubernetes/cmd/kubectl/kubectl.go:30 +0x1e

I experience different behaviors with different clients. For example I have a simple nodejs app that does the same thing by just listing the pods through the k8s sdk. In this environment i will get situations where the requests will take upwards of 5 minutes to complete

'use strict';
const process = require('process');
const k8s = require('@kubernetes/client-node');
const kc = new k8s.KubeConfig();
kc.loadFromCluster();
const k8Core = kc.makeApiClient(k8s.CoreV1Api);
let startTime; // custom log offset to help correlate with tcp dump
const log = msg => console.log(`${(new Date() - startTime) / 1000.0} ${msg}`);
let running;
let interval;
const listPods = async ()=>{
  if (running) {
    return;
  }
  running = true;
  log('Listing pods...');
  const listStart = new Date();
  const { body: { items } } = await k8Core.listNamespacedPod('default');
  const seconds = (new Date() - listStart) / 1000.0;
  log(`Found ${items.length} pods in ${seconds} seconds`);
  if(seconds > 60) {
    log(`Closing because this seems excessive`);
    process.exitCode = -1;
    clearInterval(interval);
    return;
  }
  running = false;
};
setTimeout(()=>{
  startTime = new Date();
  listPods();
  interval = setInterval(listPods, 215 * 1000);
}, 1000)

2022-06-28T02:04:30.490394199Z 36564.257 Listing pods...
2022-06-28T02:04:30.525501046Z 36564.292 Found 0 pods in 0.035 seconds
2022-06-28T02:08:05.550325353Z 36779.317 Listing pods...
2022-06-28T02:13:37.847998556Z 37111.614 Found 0 pods in 332.297 seconds  <--- Not great
2022-06-28T02:13:37.848028156Z 37111.614 Closing because this seems excessive

Expected behavior Requests should be able to complete in a reasonable amount of time. I see this across many clusters some times every few minutes. To eliminate all cluster specific variables, this is a bare bones replication of the issue, so should not suffer from user workloads effecting performance.

Environment (please complete the following information):

CLI Version [e.g. 3.22]
Kubernetes version >1.21.x

Additional context This seemed to start once we upgraded clusters from 1.20 to 1.21. I first opened a request ticket with support in January, but it has since been in the support death spiral and has gotten no where an yet to reached a team able to diagnosis or even attempt to reproduce with the simple steps above. I have sent tcpdumps, kubelet logs, etc

This is not specific to any requests we see it across many different requests. We have various workloads that may monitor the cluster using the API or dynamically create or modify workloads through the API.

Have yet been able to reproduce outside of the cluster seems to be very specific to cluster to control plane communication

This only seems to be a problem on SLA enabled clusters. Openvpn, aks-link issue? I don't see any recycling of aks-link or anything useful in the logs.

I am really curious if Konnectivity resolves the problem, but i have yet to see it make to any of my various clusters which are across many different data centers.

dglynn commented 1 year ago

@phealy any updates on this issue, we are seeing this all the time in our Gitlab runner cluster in AKS

pfarikrispy commented 1 year ago

Yes indeed. Any updates will be appreciated. Has the issue been received yet? Is it on the backlog? Is it planned in an epic? Has it been prioritized? Do you need any more contributions from stakeholders ( or the customers)?

We have especially noticed it in our gitlab runner cluster as well. Probably because it scales up and down so frequently.

We have a workaround by using an hourly "primer" CI job during office hours to keep the cluster active and prevent the issue. This helps to a degree but not always.

It also happens in other cluster running normal workloads but more seemingly random and sporadic. Christian.

On Mon, Jun 5, 2023, 18:08 Darren Glynn @.***> wrote:

@phealy https://github.com/phealy any updates on this issue, we are seeing this all the time in our Gitlab runner cluster in AKS

— Reply to this email directly, view it on GitHub https://github.com/Azure/AKS/issues/3047#issuecomment-1577082153, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC622WZAGU5KJ3F46MVIOODXJX76LANCNFSM52JOPYJA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mllab-nl commented 1 year ago

@phealy I am sure we all will appreciate an update, even if it is not a very positive one, Thank you!

ClaytonHunsinger commented 1 year ago

I would also like an update even if it's not promising. I just need to know if it will be worth my time doing the workaround of creating a new Kubernetes cluster without a load balancer for my GitLab runners. If a fix is a few weeks away then I'd rather not bother and just wait it out.

arsnyder16 commented 1 year ago

@phealy Any update would be appreciated

marcelhintermann commented 1 year ago

I have some updates on this matter. We've been in contact with Microsoft regarding the networking issues. A fix has been rolled out in the West Central US region - we created a cluster there and ran several thousand jobs without encountering this error. The rollout has also started in Northern and Western Europe (as well as possibly other regions, but I can't confirm that it has been fully resolved there).

We've also collaborated with the GitLab Runner team who have implemented a retry mechanism for failed calls to the K8s API. This solution addresses the issue of failed jobs and another problem where a GitLab Job hangs (until it times out - default: 60 minutes). The relevant merge request can be found here: https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/4143 I anticipate that it will be merged and released soon.

ghost commented 1 year ago

From one side - great! Appreciate!

From another - this issue has more than 1 year old. Life with this problem was terrible and I think 90% of all users moved to another CICD system :) or applied this workaround (as me for example):

@locomoco28 We had similar problems with our CI jobs but as mentioned in #3047 (comment) it only happens if a LoadBalancer is connected to the cluster. I verified it by deleting our ingress loadbalancer and the issue disappears. We've since moved our CI runners to a cluster without a LoadBalancer and since then I consider this issue mitigated. This issue is costing us extra money so I'm looking forward to the fix.

Anyway, thanks for fixing it at least now.

ClaytonHunsinger commented 1 year ago

I understand that this was a difficult issue to reproduce, and on top of that there was some shuffling of ownership of this issue, but thank you so much for the update and for working to fix the issue. It's also cool that the effort was made to work directly with GitLab since a lot of us who were affected were using AKS for hosting GitLab services!

I do hope that future issues that have this level of impact will be able to be resolved more quickly, but I am still grateful for the progress made here even though it took longer.

locomoco28 commented 1 year ago

Thanks for the heads up, I will probably be leaving my company by the time the fix has been rolled out in germany west central, but it's good to know that I can let my colleagues know that this issue will be resolved soon. It has been quite cumbersome to deal with.

arsnyder16 commented 1 year ago

I have been tracking clusters in ncus and germay west central and have yet to see any change in behavior. Here is the frequency of these events per hour

CVanF5 commented 1 year ago

We've also collaborated with the GitLab Runner team who have implemented a retry mechanism for failed calls to the K8s API. This solution addresses the issue of failed jobs and another problem where a GitLab Job hangs (until it times out - default: 60 minutes). The relevant merge request can be found here: https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/4143 I anticipate that it will be merged and released soon.

Unfortunately today we experienced 2 more timeouts in WestUS2, with Gitlab runner 16.2 that contains the retry mechanism mentioned above. While they don't appear frequently, I can say they still occur and no retry appears to have been attempted for this particular timeout

ERROR: Job failed (system failure): prepare environment: setting up trapping scripts on emptyDir: error sending request: Post "https://10.0.0.1:443/api/v1/namespaces

arsnyder16 commented 1 year ago

@marcelhintermann Any update on your end?

Seems as though Microsoft has went mute on us in this thread @phealy

CVanF5 commented 1 year ago

So the latest on my end, is I haven't seen this occur for about 3 weeks. I can think of only 2 reasons why this doesn't occur anymore.

The fix from Microsoft has finally rolled out in westus2
I resolved what I thought was an unrelated issue

For the record, along with these timeouts, our cluster was regularly getting "CPU Pressure" alerts on a Gitlab runner node pool. This was despite setting CPU limits. I learned this month that "CPU Pressure" alerts occur when the node CPU usage reaches >=95% of the node's Allocatable CPU capacity. The CPU limits I set were too high. I took the allocatable limit, multiplied it by 0.94 and subtracted the default pod cpu requests like kube-proxy and used the new value for the CPU limits.

Since then, I have no more CPU pressure alerts, and coincidentally no more timeouts. It would be great to hear from others on this thread if their issues are resolved as well.

Screenshot 2023-08-25 at 18 47 03

JRBANCEL commented 1 year ago

I suspect the issue is with SNAT.

We are creating hundreds of AKS clusters per day and the same exact code that runs fine on EKS/Kind/RKE2 experiences those random net/http: TLS handshake timeout errors.

What we observed:

:red_circle: Nodes without Public IP + Load Balancer: issue observed
:green_circle: Nodes without Public IP + no Load Balancer: issue not observed
:green_circle: Nodes with Public IP + Load Balancer: issue not observed

This points out to SNAT because in Azure depending whether a VM has a public IP or is behind a load balancer affects how SNAT works. See https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-outbound-connections

Then we suspected it could be related to SNAT port exhaustion, but we observed the issue in cluster where the metric wasn't showing any SNAT exhaustion.

Not having a Load Balancer is not practical in our case, so we opted for adding a Public IP to each node. We haven't seen the issue since then.

arsnyder16 commented 1 year ago

@phealy Any update you could share?

juan-carvajal commented 1 year ago

@phealy Hello, any updates on this issue?

winterrobert commented 1 year ago

We are experiencing this issue as well running Argo Events on AKS.

pfarikrispy commented 1 year ago

I can be mistaken but the issue was resolved for a long time but recently we've been seeing the timeouts and long delay for auto-scaler to add a node return in our gitlab-runner node pool on AKS v1.26, upgrading to 1.27 soon. I'll let you know

Christian.

On Thu, Nov 9, 2023 at 1:28 PM Robert Winter @.***> wrote:

We are experiencing this issue as well running Argo Events on AKS.

— Reply to this email directly, view it on GitHub https://github.com/Azure/AKS/issues/3047#issuecomment-1803739888, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC622W3MRRZMIJ3PV5L63RTYDTD7DAVCNFSM52JOPYJKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBQGM3TGOJYHA4A . You are receiving this because you are subscribed to this thread.Message ID: @.***>

arsnyder16 commented 1 year ago

@phealy Any update here? I still experience this and we are closing in on 2 years since we first started experiencing this in our clusters. I opened a Microsoft Support ticket in Jan 2022, and got no where and then was able to finally get some traction after opening this in June 2022.

@JohnRusk Are you able to get any updates? Back in Jan 2023 it sounded like Microsoft was able to reproduce and was working on a fix

The community has failed to receive any update since it was assigned to @phealy ~Sept 2022.

phealy commented 12 months ago

@arsnyder16 Apologies for not getting back on this - this turned out to be a very difficult networking bug to nail down. We have been working on it as we could find more information regularly and were finally able to get a very solid reproduction about 8 weeks ago. This let us get traces and the issue has been found; the fix is currently being completed and will start rolling out early next year.

pfarikrispy commented 12 months ago

Thank you so much for confirming the resolution for this bug! I'm glad to hear it and anxiously await the deployment of a solution.

Thank you to all of you who've contributed to this and for reminding Azure to give us feedback. :) Christian.

On Mon, Dec 4, 2023 at 7:08 PM Patrick W. Healy @.***> wrote:

@arsnyder16 https://github.com/arsnyder16 Apologies for not getting back on this - this turned out to be a very difficult networking bug to nail down. We have been working on it as we could find more information regularly and were finally able to get a very solid reproduction about 8 weeks ago. This let us get traces and the issue has been found; the fix is currently being completed and will start rolling out early next year.

— Reply to this email directly, view it on GitHub https://github.com/Azure/AKS/issues/3047#issuecomment-1839197302, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC622W7LPTZ4IBLF33G2PXTYHYGSTAVCNFSM52JOPYJKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBTHEYTSNZTGAZA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

arsnyder16 commented 12 months ago

Thanks @phealy ! Can you supply any details on how this fix will roll out?

Is this something customers will need to take action on aks upgrade/ node image upgrade etc? Is this internal to Microsoft infrastructure?

phealy commented 12 months ago

It's an internal component in the network stack - no customer action will be needed.

arsnyder16 commented 11 months ago

@phealy Curious if this is being tracked elsewhere publicly that we should be tracking instead. More specifically any other github issues in another repo?

Tiduster commented 10 months ago

Hi ! We stepped on this issue yesterday during a production upgrade. It seems the issue is still there somewhere.

Do you have any news about this issue? As it been fixed for all AKS clusters?

Thanks a lot for the help on this subject.

Best regards,

arsnyder16 commented 10 months ago

I am still seeing in our clusters. I am not aware of Microsoft rolling out any fix as of yet. Although the last update was that it might be rolling out sometime early this year. No definitive date has been communicated.

cedric-appdirect commented 10 months ago

We are seeing it in our cluster that we just upgraded to 1.27. Our old one running 1.21 is not impacted by the problem. It would be good to hear back and get an update on the progress regarding the fix to this issue.

danfinn commented 10 months ago

We may be effected by this as well however the errors we've been seeing have not mentioned anything about TLS handshake so far. We've got 2 different clusters in EastUS that started having pods report errors timing out while talking to the API within the past week or so.

here is a log entry from one of them:

2024-02-05 21:54:49 +0000 [error]: config error file="main.conf" error_class=Fluent::ConfigError error="Invalid Kubernetes API v1 endpoint
[https://10.0.0.1:443/api:](https://10.0.0.1/api:)
Timed out connecting to server"

I've got a support case open with Azure and have them looking at it now. We are on 1.23.8 and 1.26.6 FWIW.

RishabhKumarGupta commented 9 months ago

We are seeing it in our cluster a lot now from when we have upgraded kubernetes version from 1.26 to 1.27 ; previously it was very less . @phealy when the code will be pushed ?

ebuildy commented 9 months ago

This become really serious, as everybody use gitops tools such as argocd etc... AKS is just broken !

Please how can we improve it?

Capture d’écran, le 2024-02-23 à 14 03 02

alexku7 commented 8 months ago

@arsnyder16 Apologies for not getting back on this - this turned out to be a very difficult networking bug to nail down. We have been working on it as we could find more information regularly and were finally able to get a very solid reproduction about 8 weeks ago. This let us get traces and the issue has been found; the fix is currently being completed and will start rolling out early next year.

14th of Mar 2024 is here and bug is still with us ...Unbelievable

cedric-appdirect commented 8 months ago

14th of Mar 2024 is here and bug is still with us ...Unbelievable

I would also add that the Azure Level 1 customer support is not even able to recognize this problem and is a waste of time for any one engaging with them which almost 2 years laters is also unbelievable.

JRBANCEL commented 8 months ago

Have you tried what I mentioned in https://github.com/Azure/AKS/issues/3047#issuecomment-1721877251? Does this happen when you give a public IP to each node on the cluster? Since that change, we haven't observed it once and we create/delete hundreds of clusters per day. It is a straightforward workaround until the proper fix.

alexku7 commented 8 months ago

Have you tried what I mentioned in https://github.com/Azure/AKS/issues/3047#issuecomment-1721877251?

Does this happen when you give a public IP to each node on the cluster?

Since that change, we haven't observed it once and we create/delete hundreds of clusters per day.

It is a straightforward workaround until the proper fix.

Hi

Exposing nodes by assigning a public ip is violates all best security practices.

I don't think that it can be a legitimate workaround.

By the way this feature ( access the control plane by private ip)is also solves the problem

https://learn.microsoft.com/en-us/azure/aks/api-server-vnet-integration

But the feature is in the preview and not recommended for production.

It's another azure shame. The control api private access is something available by default for at least 3 years in GCP and AWS but still unavailable in Azure.

ebuildy commented 8 months ago

Using a private k8s API solve this issue !

juan-carvajal commented 8 months ago

Using a private k8s API solve this issue !

@ebuildy Can you elaborate, please?

ebuildy commented 8 months ago

Using a private k8s API solve this issue !

@ebuildy Can you elaborate, please?

Well I am not working on Azure, but it seems public / private AKS change a lot the networking stuff.

Dont know what happen under the hood (I think network route stuff) but it's working.

jorgemiguelsanchez commented 7 months ago

@arsnyder16 Apologies for not getting back on this - this turned out to be a very difficult networking bug to nail down. We have been working on it as we could find more information regularly and were finally able to get a very solid reproduction about 8 weeks ago. This let us get traces and the issue has been found; the fix is currently being completed and will start rolling out early next year.

Hi, @phealy. Any update on this issue?

Thanks.

fbertos commented 7 months ago

It's an internal component in the network stack - no customer action will be needed.

Hi @phealy same problem here, any news on this issue?? Thank you.

Dima-Diachenko commented 7 months ago

Hello,

Some time ago (maybe 1 year+) we suffered badly from this problem. It was nightmare to use GitLab runners on the clusters with Load balancer IP. So I mitigated this issue by moving all GitLab runners to the separate cluster that does not have such one IPs.

A few days ago I decided to test if this issue was fixed - so I had updated runners to the latest version (v16.11.0) and AKS cluster up to v1.28 and ran few hundred or parallel tests (simple stress test for 60 seconds) - and all is green!

I did it at 3 clusters (that has service type LoadBanancer) and amount of failed or stuck tests is 0 (at 1000 jobs). Nice news, so guys you can try.

fbertos commented 7 months ago

Hi @Dima-Diachenko, thanks for your reply. Then, we are upgrading our AKS to v1.28 and check. We will let you know. Let's try that. Thanks everybody!

ebuildy commented 7 months ago

Hello,

Some time ago (maybe 1 year+) we suffered badly from this problem. It was nightmare to use GitLab runners on the clusters with Load balancer IP. So I mitigated this issue by moving all GitLab runners to the separate cluster that does not have such one IPs.

A few days ago I decided to test if this issue was fixed - so I had updated runners to the latest version (v16.11.0) and AKS cluster up to v1.28 and ran few hundred or parallel tests (simple stress test for 60 seconds) - and all is green!

I did it at 3 clusters (that has service type LoadBanancer) and amount of failed or stuck tests is 0 (at 1000 jobs). Nice news, so guys you can try.

Upgrading to kubernetes v1.28 dont solve this infra network issue. This is due to a cronntrack explosion, because Azure use lof of NAT magic, and they are good for this ^^

The only solution is to use a full private AKS cluster.

CVanF5 commented 7 months ago

We've had pretty good success with a public AKS cluster (With API IP restricted). The Gitlab runners have come a long way and can tolerate the sorts of errors we have in this issue. These runners are on a dedicated AKS cluster with no ingress-controller and no LoadBalancer services. Having no LoadBalancer services greatly reduces the frequency of these errors, but they do still occur occasionally.

Here's a log example of our most recent occurrence of this bug, (April 26th) but the Gitlab runner retried the request and recovered. Note the warning is on the runner orchestrator pod, not the job pod.

WARNING: Error streaming logs k8s-amd64-xlarge-runner/runner-00000000-project-00000000-concurrent-5-9kqj0er6/helper:/logs-000000000-000000000/output.log:
error sending request: 
Post "https://10.0.0.1:443/api/v1/namespaces/k8s-amd64-xlarge-runner/pods/runner-mxjjyskks-project-0000000-concurrent-5-9kqj0er6/exec?command=gitlab-runner-helper&command=read-logs&command=--path&command=%2Flogs-00000000-0000000000%2Foutput.log&command=--offset&command=4945&command=--wait-file-timeout&command=1m0s&container=helper&container=helper&stderr=true&stdout=true":
dial tcp 10.0.0.1:443: connect: connection refused. Retrying... job=0000000000 project=00000000 runner=0000000

So at least in the Gitlab runner use case, I can confirm this bug still occurs but the Gitlab runner can now tolerate these errors and recover. The issue still exists but has been completely mitigated and no longer impacts production.

ebuildy commented 7 months ago

You are right, without Load Balancer the Azure network is different. Without public LB ---> this is called "private" ^^

jorgemiguelsanchez commented 7 months ago

Hi, @phealy. Could you help us with this issue? I think there are quite a few customers with this problem. Anyway, what workaround do you recommend among those mentioned in this thread?

Thanks for your support.

phealy commented 6 months ago

There are a few possible mitigations for this until the bug fix rolls out, which (as of my last update) is on track for the August/September timeframe at this point.

The bug occurs only when you have a client behind SLB outbound rules talking to a service behind SLB, both in the same region. Changing part of that equation will prevent the issue from occurring.

Migrate to NAT gateway outbound, which can be done after cluster creation.
Create new node pools with instance-level public IPs enabled and run the workloads that are experiencing the issue there - because they will use the ILPIP for outbound traffic, they won't use the SLB outbound rule.
- Note that if you're using API server authorized IP ranges you'll need to use a public IP prefix instead and add the prefix range to the authorized IP ranges before adding the nodepool.
Use a private AKS cluster or API Server VNet integration (preview) instead of a public cluster. You can't currently migrate from public to private, but you can migrate to API Server VNet Integration without recreating the cluster. This moves the communication path between the cluster and the API server to a private IP, and can still leave a public endpoint enabled for non-cluster usage.

jorgemiguelsanchez commented 6 months ago

Thanks for your response, @phealy. We are considering the use of API Server Integration, but we are not sure since it's still in preview. Do yo think it's safe to use this option? (It's a one-way migration) When is it expected to be in available in GA?

roman-aleksejuk-telia commented 6 months ago

@phealy Is it possible to provide a bit more details on the bug itself?

Which approx. percentage of clusters does this issue affect? (we do have couple almost identical clusters using outbound loadbalancers in the same region and only one of them is affected).
Is the bug kubernetes version or region dependent?
When you state that "bug occurs only when you have a client behind SLB outbound rules talking to a service behind SLB", does that mean that any pod-to-pod inter-node communication also affected? (two pods located in different nodes). Or do you mean just the 'client to kubernetes API' communication?

arsnyder16 commented 6 months ago

@phealy Is there anywhere for the community to track this progress of this more specifically?

Is it part of another repo?
Is there another issue logged elsewhere?
Is there any public acknowledgment from Azure other than this issue around the problem we can be pointed to? Anytime i have spoke to Azure Support Engineers they are unaware of anything.

The current status we have is is on track for the August/September timeframe at this point.

Back in Decemember 2023 we had the fix is currently being completed and will start rolling out early next year.

Seems more appropriate to track the progress more closely to the actual work since this seems to and issue outside of AKS but exposed as a result of how the AKS infrastructure is setup.

arsnyder16 commented 4 months ago

@phealy Is this still on track to be fixed August/September ?

Azure / AKS

[BUG] Cluster workloads to the kubernetes API server will intermittently timeout or takes minutes to complete #3047