Open thomastvedt opened 1 year ago
Action required from @Azure/aks-pm
What kind of network plugin are you using? This can happen on scale down if there are any connections on the node being scaled down that are being fwded by the kube-proxy to some other node, that will reset them. Is that what you're seeing? It might helpful to get a ticket going so we can take a look
What kind of network plugin are you using?
Network type (plugin): Azure CNI
Kubernetes version: 1.25.5
This can happen on scale down if there are any connections on the node being scaled down that are being fwded by the kube-proxy to some other node, that will reset them. Is that what you're seeing?
I'm not sure how I can tell if this is what I'm seeing 😅
It might helpful to get a ticket going so we can take a look
That would be great, however, I'll have to check with management if we can purchase a support plan first ;)
In the mean time, please let me know if there is any more information from me that could be helpful 🙏
Action required from @Azure/aks-pm
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
We disabled node autoscaling because of this issue, we are still very interested in a fix for this as we could run on two nodes most of the time and 3 nodes during work hours.
Issue needing attention of @Azure/aks-leads
There were some changes in the probe setup at the cloud-provider-azure level, check out https://cloud-provider-azure.sigs.k8s.io/topics/loadbalancer/#custom-load-balancer-health-probe You might want to precise the default path for the probe
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Describe the bug When cluster is autoscaling a service exposed on the internal load balancer stops responding for a short while. It looks like the internal load balancer becomes unstable when cluster is autoscaling.
More details:
We enabled autoscaling on our AKS cluster. Normally this runs at 2 nodes, when traffic increases it is scaled up to 3 for a short while, and then back down to 2 nodes.
We have a redis service hosted in our cluster, exposed as a LoadBalancer service, using annotations to use an internal load balancer:
This makes our Redis service available outside the cluster, inside our vnet, reachable by a different service that currently runs on a separate virtual machine.
When the cluster auto-scales up / down, we observe Redis timeout exceptions in our service running outside the cluster.
We observe a drop/spike in traffic to our Redis service:![image](https://user-images.githubusercontent.com/4746639/235874852-efce89a2-6eca-4646-b000-8849b5ebf981.png)
The Redis pod/instance is not killed when autoscaling.
There is a drop in "health probe status" in the underlying load balancer resource in Azure control panel:
And data path availability:
We also observe an event in our cluster when this happens: "Updated load balancer with new hosts".
Expected behavior I expected that the internal load balancer would work 100% also when the cluster is autoscaling.
Environment (please complete the following information):
How can I troubleshoot this further? Can I view logs from the internal load balancer? Where can I find logs?