Closed liamgib closed 6 months ago
Hi liamgib, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.
I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!
It should be clarified that operations to the control plane do succeed, but these errors are causing a % of requests to fail at times.
Triage required from @Azure/aks-pm
Action required from @Azure/aks-pm
Issue needing attention of @Azure/aks-leads
Any updates on this?
We are seeing similar issues in our GitLab Runner running on AKS
2023-10-03T14:14:56+02:00 [0;33mWARNING: Retrying... [0;m [0;33merror[0;m=error dialing backend: read unix @->/tunnel-uds/proxysocket: read: connection reset by peer [0;33mjob[0;m=654466 [0;33mproject[0;m=400 [0;33mrunner[0;m=n9AZ8sJG
This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.
Issue needing attention of @Azure/aks-leads
This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.
This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.
Action required from @merooney, @bmoore-msft.
Triage required from @Azure/aks-pm @merooney, @bmoore-msft
This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.
This issue will now be closed because it hasn't had any activity for 7 days after stale. liamgib feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.
What happened: Hi, in two of our clusters we have seen concerning errors in our API Server diagnostics logs and are also observing operations to the control plane from within the cluster timeout or respond with a 503 Service Unavailable error.
In one of our mission critical clusters, we have seen errors of this nature for 100+ days. We've opened multiple Sev A tickets, but the support engineers are unable to provide support and have stopped responding.
and
Note: The above logs are different snippets, as there are thousands of these errors an hour.
We've also seen the following errors when trying to port-forward services.
What you expected to happen: There should be no errors in our control planes, and operations to the API servers should succeed.
How to reproduce it (as minimally and precisely as possible): Unknown
Anything else we need to know?: This is impacting management operations such as kubectl port-forward and API server requests both internally and externally. When we see an increase of errors in the API server, we often see the following services crash as they can't talk to the API server which they rely on.
Environment:
Kubernetes version (use
kubectl version
):Size of cluster (how many worker nodes are in the cluster?) 30 - 40 nodes
General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.) Linkerd service mesh, with NodeJs, PHP & Golang microservices talking to Azure Service Bus and CosmosDB.
Others: