Open arsnyder16 opened 2 years ago
@phealy any updates on this issue, we are seeing this all the time in our Gitlab runner cluster in AKS
Yes indeed. Any updates will be appreciated. Has the issue been received yet? Is it on the backlog? Is it planned in an epic? Has it been prioritized? Do you need any more contributions from stakeholders ( or the customers)?
We have especially noticed it in our gitlab runner cluster as well. Probably because it scales up and down so frequently.
We have a workaround by using an hourly "primer" CI job during office hours to keep the cluster active and prevent the issue. This helps to a degree but not always.
It also happens in other cluster running normal workloads but more seemingly random and sporadic. Christian.
On Mon, Jun 5, 2023, 18:08 Darren Glynn @.***> wrote:
@phealy https://github.com/phealy any updates on this issue, we are seeing this all the time in our Gitlab runner cluster in AKS
— Reply to this email directly, view it on GitHub https://github.com/Azure/AKS/issues/3047#issuecomment-1577082153, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC622WZAGU5KJ3F46MVIOODXJX76LANCNFSM52JOPYJA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
@phealy I am sure we all will appreciate an update, even if it is not a very positive one, Thank you!
I would also like an update even if it's not promising. I just need to know if it will be worth my time doing the workaround of creating a new Kubernetes cluster without a load balancer for my GitLab runners. If a fix is a few weeks away then I'd rather not bother and just wait it out.
@phealy Any update would be appreciated
I have some updates on this matter. We've been in contact with Microsoft regarding the networking issues. A fix has been rolled out in the West Central US region - we created a cluster there and ran several thousand jobs without encountering this error. The rollout has also started in Northern and Western Europe (as well as possibly other regions, but I can't confirm that it has been fully resolved there).
We've also collaborated with the GitLab Runner team who have implemented a retry mechanism for failed calls to the K8s API. This solution addresses the issue of failed jobs and another problem where a GitLab Job hangs (until it times out - default: 60 minutes). The relevant merge request can be found here: https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/4143 I anticipate that it will be merged and released soon.
From one side - great! Appreciate!
From another - this issue has more than 1 year old. Life with this problem was terrible and I think 90% of all users moved to another CICD system :) or applied this workaround (as me for example):
@locomoco28 We had similar problems with our CI jobs but as mentioned in #3047 (comment) it only happens if a LoadBalancer is connected to the cluster. I verified it by deleting our ingress loadbalancer and the issue disappears. We've since moved our CI runners to a cluster without a LoadBalancer and since then I consider this issue mitigated. This issue is costing us extra money so I'm looking forward to the fix.
Anyway, thanks for fixing it at least now.
I understand that this was a difficult issue to reproduce, and on top of that there was some shuffling of ownership of this issue, but thank you so much for the update and for working to fix the issue. It's also cool that the effort was made to work directly with GitLab since a lot of us who were affected were using AKS for hosting GitLab services!
I do hope that future issues that have this level of impact will be able to be resolved more quickly, but I am still grateful for the progress made here even though it took longer.
Thanks for the heads up, I will probably be leaving my company by the time the fix has been rolled out in germany west central, but it's good to know that I can let my colleagues know that this issue will be resolved soon. It has been quite cumbersome to deal with.
I have been tracking clusters in ncus and germay west central and have yet to see any change in behavior. Here is the frequency of these events per hour
We've also collaborated with the GitLab Runner team who have implemented a retry mechanism for failed calls to the K8s API. This solution addresses the issue of failed jobs and another problem where a GitLab Job hangs (until it times out - default: 60 minutes). The relevant merge request can be found here: https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/4143 I anticipate that it will be merged and released soon.
Unfortunately today we experienced 2 more timeouts in WestUS2, with Gitlab runner 16.2 that contains the retry mechanism mentioned above. While they don't appear frequently, I can say they still occur and no retry appears to have been attempted for this particular timeout
ERROR: Job failed (system failure): prepare environment: setting up trapping scripts on emptyDir: error sending request: Post "https://10.0.0.1:443/api/v1/namespaces
@marcelhintermann Any update on your end?
Seems as though Microsoft has went mute on us in this thread @phealy
So the latest on my end, is I haven't seen this occur for about 3 weeks. I can think of only 2 reasons why this doesn't occur anymore.
westus2
For the record, along with these timeouts, our cluster was regularly getting "CPU Pressure" alerts on a Gitlab runner node pool. This was despite setting CPU limits. I learned this month that "CPU Pressure" alerts occur when the node CPU usage reaches >=95% of the node's Allocatable
CPU capacity. The CPU limits I set were too high. I took the allocatable limit, multiplied it by 0.94 and subtracted the default pod cpu requests like kube-proxy
and used the new value for the CPU limits.
Since then, I have no more CPU pressure alerts, and coincidentally no more timeouts. It would be great to hear from others on this thread if their issues are resolved as well.
I suspect the issue is with SNAT.
We are creating hundreds of AKS clusters per day and the same exact code that runs fine on EKS/Kind/RKE2 experiences those random net/http: TLS handshake timeout
errors.
What we observed:
This points out to SNAT because in Azure depending whether a VM has a public IP or is behind a load balancer affects how SNAT works. See https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-outbound-connections
Then we suspected it could be related to SNAT port exhaustion, but we observed the issue in cluster where the metric wasn't showing any SNAT exhaustion.
Not having a Load Balancer is not practical in our case, so we opted for adding a Public IP to each node. We haven't seen the issue since then.
@phealy Any update you could share?
@phealy Hello, any updates on this issue?
We are experiencing this issue as well running Argo Events on AKS.
I can be mistaken but the issue was resolved for a long time but recently we've been seeing the timeouts and long delay for auto-scaler to add a node return in our gitlab-runner node pool on AKS v1.26, upgrading to 1.27 soon. I'll let you know
Christian.
On Thu, Nov 9, 2023 at 1:28 PM Robert Winter @.***> wrote:
We are experiencing this issue as well running Argo Events on AKS.
— Reply to this email directly, view it on GitHub https://github.com/Azure/AKS/issues/3047#issuecomment-1803739888, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC622W3MRRZMIJ3PV5L63RTYDTD7DAVCNFSM52JOPYJKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBQGM3TGOJYHA4A . You are receiving this because you are subscribed to this thread.Message ID: @.***>
@phealy Any update here? I still experience this and we are closing in on 2 years since we first started experiencing this in our clusters. I opened a Microsoft Support ticket in Jan 2022, and got no where and then was able to finally get some traction after opening this in June 2022.
@JohnRusk Are you able to get any updates? Back in Jan 2023 it sounded like Microsoft was able to reproduce and was working on a fix
The community has failed to receive any update since it was assigned to @phealy ~Sept 2022.
@arsnyder16 Apologies for not getting back on this - this turned out to be a very difficult networking bug to nail down. We have been working on it as we could find more information regularly and were finally able to get a very solid reproduction about 8 weeks ago. This let us get traces and the issue has been found; the fix is currently being completed and will start rolling out early next year.
Thank you so much for confirming the resolution for this bug! I'm glad to hear it and anxiously await the deployment of a solution.
Thank you to all of you who've contributed to this and for reminding Azure to give us feedback. :) Christian.
On Mon, Dec 4, 2023 at 7:08 PM Patrick W. Healy @.***> wrote:
@arsnyder16 https://github.com/arsnyder16 Apologies for not getting back on this - this turned out to be a very difficult networking bug to nail down. We have been working on it as we could find more information regularly and were finally able to get a very solid reproduction about 8 weeks ago. This let us get traces and the issue has been found; the fix is currently being completed and will start rolling out early next year.
— Reply to this email directly, view it on GitHub https://github.com/Azure/AKS/issues/3047#issuecomment-1839197302, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC622W7LPTZ4IBLF33G2PXTYHYGSTAVCNFSM52JOPYJKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBTHEYTSNZTGAZA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thanks @phealy ! Can you supply any details on how this fix will roll out?
Is this something customers will need to take action on aks upgrade/ node image upgrade etc? Is this internal to Microsoft infrastructure?
It's an internal component in the network stack - no customer action will be needed.
@phealy Curious if this is being tracked elsewhere publicly that we should be tracking instead. More specifically any other github issues in another repo?
Hi ! We stepped on this issue yesterday during a production upgrade. It seems the issue is still there somewhere.
Do you have any news about this issue? As it been fixed for all AKS clusters?
Thanks a lot for the help on this subject.
Best regards,
I am still seeing in our clusters. I am not aware of Microsoft rolling out any fix as of yet. Although the last update was that it might be rolling out sometime early this year. No definitive date has been communicated.
We are seeing it in our cluster that we just upgraded to 1.27. Our old one running 1.21 is not impacted by the problem. It would be good to hear back and get an update on the progress regarding the fix to this issue.
We may be effected by this as well however the errors we've been seeing have not mentioned anything about TLS handshake so far. We've got 2 different clusters in EastUS that started having pods report errors timing out while talking to the API within the past week or so.
here is a log entry from one of them:
2024-02-05 21:54:49 +0000 [error]: config error file="main.conf" error_class=Fluent::ConfigError error="Invalid Kubernetes API v1 endpoint
[https://10.0.0.1:443/api:](https://10.0.0.1/api:)
Timed out connecting to server"
I've got a support case open with Azure and have them looking at it now. We are on 1.23.8 and 1.26.6 FWIW.
We are seeing it in our cluster a lot now from when we have upgraded kubernetes version from 1.26 to 1.27 ; previously it was very less . @phealy when the code will be pushed ?
This become really serious, as everybody use gitops tools such as argocd etc... AKS is just broken !
Please how can we improve it?
@arsnyder16 Apologies for not getting back on this - this turned out to be a very difficult networking bug to nail down. We have been working on it as we could find more information regularly and were finally able to get a very solid reproduction about 8 weeks ago. This let us get traces and the issue has been found; the fix is currently being completed and will start rolling out early next year.
14th of Mar 2024 is here and bug is still with us ...Unbelievable
14th of Mar 2024 is here and bug is still with us ...Unbelievable
I would also add that the Azure Level 1 customer support is not even able to recognize this problem and is a waste of time for any one engaging with them which almost 2 years laters is also unbelievable.
Have you tried what I mentioned in https://github.com/Azure/AKS/issues/3047#issuecomment-1721877251? Does this happen when you give a public IP to each node on the cluster? Since that change, we haven't observed it once and we create/delete hundreds of clusters per day. It is a straightforward workaround until the proper fix.
Have you tried what I mentioned in https://github.com/Azure/AKS/issues/3047#issuecomment-1721877251?
Does this happen when you give a public IP to each node on the cluster?
Since that change, we haven't observed it once and we create/delete hundreds of clusters per day.
It is a straightforward workaround until the proper fix.
Hi
Exposing nodes by assigning a public ip is violates all best security practices.
I don't think that it can be a legitimate workaround.
By the way this feature ( access the control plane by private ip)is also solves the problem
https://learn.microsoft.com/en-us/azure/aks/api-server-vnet-integration
But the feature is in the preview and not recommended for production.
It's another azure shame. The control api private access is something available by default for at least 3 years in GCP and AWS but still unavailable in Azure.
Using a private k8s API solve this issue !
Using a private k8s API solve this issue !
@ebuildy Can you elaborate, please?
Using a private k8s API solve this issue !
@ebuildy Can you elaborate, please?
Well I am not working on Azure, but it seems public / private AKS change a lot the networking stuff.
Dont know what happen under the hood (I think network route stuff) but it's working.
@arsnyder16 Apologies for not getting back on this - this turned out to be a very difficult networking bug to nail down. We have been working on it as we could find more information regularly and were finally able to get a very solid reproduction about 8 weeks ago. This let us get traces and the issue has been found; the fix is currently being completed and will start rolling out early next year.
Hi, @phealy. Any update on this issue?
Thanks.
It's an internal component in the network stack - no customer action will be needed.
Hi @phealy same problem here, any news on this issue?? Thank you.
Hello,
Some time ago (maybe 1 year+) we suffered badly from this problem. It was nightmare to use GitLab runners on the clusters with Load balancer IP. So I mitigated this issue by moving all GitLab runners to the separate cluster that does not have such one IPs.
A few days ago I decided to test if this issue was fixed - so I had updated runners to the latest version (v16.11.0) and AKS cluster up to v1.28 and ran few hundred or parallel tests (simple stress test for 60 seconds) - and all is green!
I did it at 3 clusters (that has service type LoadBanancer) and amount of failed or stuck tests is 0 (at 1000 jobs). Nice news, so guys you can try.
Hi @Dima-Diachenko, thanks for your reply. Then, we are upgrading our AKS to v1.28 and check. We will let you know. Let's try that. Thanks everybody!
Hello,
Some time ago (maybe 1 year+) we suffered badly from this problem. It was nightmare to use GitLab runners on the clusters with Load balancer IP. So I mitigated this issue by moving all GitLab runners to the separate cluster that does not have such one IPs.
A few days ago I decided to test if this issue was fixed - so I had updated runners to the latest version (v16.11.0) and AKS cluster up to v1.28 and ran few hundred or parallel tests (simple stress test for 60 seconds) - and all is green!
I did it at 3 clusters (that has service type LoadBanancer) and amount of failed or stuck tests is 0 (at 1000 jobs). Nice news, so guys you can try.
Upgrading to kubernetes v1.28 dont solve this infra network issue. This is due to a cronntrack explosion, because Azure use lof of NAT magic, and they are good for this ^^
The only solution is to use a full private AKS cluster.
We've had pretty good success with a public AKS cluster (With API IP restricted). The Gitlab runners have come a long way and can tolerate the sorts of errors we have in this issue. These runners are on a dedicated AKS cluster with no ingress-controller and no LoadBalancer services. Having no LoadBalancer services greatly reduces the frequency of these errors, but they do still occur occasionally.
Here's a log example of our most recent occurrence of this bug, (April 26th) but the Gitlab runner retried the request and recovered. Note the warning is on the runner orchestrator pod, not the job pod.
WARNING: Error streaming logs k8s-amd64-xlarge-runner/runner-00000000-project-00000000-concurrent-5-9kqj0er6/helper:/logs-000000000-000000000/output.log:
error sending request:
Post "https://10.0.0.1:443/api/v1/namespaces/k8s-amd64-xlarge-runner/pods/runner-mxjjyskks-project-0000000-concurrent-5-9kqj0er6/exec?command=gitlab-runner-helper&command=read-logs&command=--path&command=%2Flogs-00000000-0000000000%2Foutput.log&command=--offset&command=4945&command=--wait-file-timeout&command=1m0s&container=helper&container=helper&stderr=true&stdout=true":
dial tcp 10.0.0.1:443: connect: connection refused. Retrying... job=0000000000 project=00000000 runner=0000000
So at least in the Gitlab runner use case, I can confirm this bug still occurs but the Gitlab runner can now tolerate these errors and recover. The issue still exists but has been completely mitigated and no longer impacts production.
You are right, without Load Balancer the Azure network is different. Without public LB ---> this is called "private" ^^
Hi, @phealy. Could you help us with this issue? I think there are quite a few customers with this problem. Anyway, what workaround do you recommend among those mentioned in this thread?
Thanks for your support.
There are a few possible mitigations for this until the bug fix rolls out, which (as of my last update) is on track for the August/September timeframe at this point.
The bug occurs only when you have a client behind SLB outbound rules talking to a service behind SLB, both in the same region. Changing part of that equation will prevent the issue from occurring.
Thanks for your response, @phealy. We are considering the use of API Server Integration, but we are not sure since it's still in preview. Do yo think it's safe to use this option? (It's a one-way migration) When is it expected to be in available in GA?
@phealy Is it possible to provide a bit more details on the bug itself?
@phealy Is there anywhere for the community to track this progress of this more specifically?
The current status we have is
is on track for the August/September timeframe at this point.
Back in Decemember 2023 we had
the fix is currently being completed and will start rolling out early next year.
Seems more appropriate to track the progress more closely to the actual work since this seems to and issue outside of AKS but exposed as a result of how the AKS infrastructure is setup.
@phealy Is this still on track to be fixed August/September ?
Describe the bug Requests from cluster workloads to the kubernetes API server will intermittently timeout or takes minutes to complete, depending on the workloads request settings.
To Reproduce Steps to reproduce the behavior: Provision a new SLA enabled cluster
This may be optional but might help produce the problem. Install nginx ingress, and you can just leave the replicas as 0 to avoid adding any more noise to the cluster
deploy a simple workload that just used kubectl to list the pods in a namespace, these jobs will fail once they detect the issue.
With the kubectl example above it will manifest in a timeout trying to do TLS handshake. What is strange about the kubectl log output is it does seem to have the response body, but it is show as a header.
Here is an example of succesful run
Here is an example of one that fails just a few minutes later
I experience different behaviors with different clients. For example I have a simple nodejs app that does the same thing by just listing the pods through the k8s sdk. In this environment i will get situations where the requests will take upwards of 5 minutes to complete
Expected behavior Requests should be able to complete in a reasonable amount of time. I see this across many clusters some times every few minutes. To eliminate all cluster specific variables, this is a bare bones replication of the issue, so should not suffer from user workloads effecting performance.
Environment (please complete the following information):
Additional context This seemed to start once we upgraded clusters from 1.20 to 1.21. I first opened a request ticket with support in January, but it has since been in the support death spiral and has gotten no where an yet to reached a team able to diagnosis or even attempt to reproduce with the simple steps above. I have sent tcpdumps, kubelet logs, etc
This is not specific to any requests we see it across many different requests. We have various workloads that may monitor the cluster using the API or dynamically create or modify workloads through the API.
Have yet been able to reproduce outside of the cluster seems to be very specific to cluster to control plane communication
This only seems to be a problem on SLA enabled clusters. Openvpn, aks-link issue? I don't see any recycling of aks-link or anything useful in the logs.
I am really curious if Konnectivity resolves the problem, but i have yet to see it make to any of my various clusters which are across many different data centers.