Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.93k stars 293 forks source link

[BUG] Internal load balancer unstable when autoscaling cluster #3634

Open thomastvedt opened 1 year ago

thomastvedt commented 1 year ago

Describe the bug When cluster is autoscaling a service exposed on the internal load balancer stops responding for a short while. It looks like the internal load balancer becomes unstable when cluster is autoscaling.

More details:
We enabled autoscaling on our AKS cluster. Normally this runs at 2 nodes, when traffic increases it is scaled up to 3 for a short while, and then back down to 2 nodes.

We have a redis service hosted in our cluster, exposed as a LoadBalancer service, using annotations to use an internal load balancer:

  service:
    type: LoadBalancer
    ports:
      redis: 6379
    externalTrafficPolicy: Cluster
    annotations:
      service.beta.kubernetes.io/azure-load-balancer-internal: "true"
      service.beta.kubernetes.io/azure-load-balancer-ipv4: 10.111.16.123

This makes our Redis service available outside the cluster, inside our vnet, reachable by a different service that currently runs on a separate virtual machine.

When the cluster auto-scales up / down, we observe Redis timeout exceptions in our service running outside the cluster.

We observe a drop/spike in traffic to our Redis service: image

The Redis pod/instance is not killed when autoscaling.

There is a drop in "health probe status" in the underlying load balancer resource in Azure control panel:

image

And data path availability:

image

We also observe an event in our cluster when this happens: "Updated load balancer with new hosts".

Expected behavior I expected that the internal load balancer would work 100% also when the cluster is autoscaling.

Environment (please complete the following information):

How can I troubleshoot this further? Can I view logs from the internal load balancer? Where can I find logs?

ghost commented 1 year ago

Action required from @Azure/aks-pm

palma21 commented 1 year ago

What kind of network plugin are you using? This can happen on scale down if there are any connections on the node being scaled down that are being fwded by the kube-proxy to some other node, that will reset them. Is that what you're seeing? It might helpful to get a ticket going so we can take a look

thomastvedt commented 1 year ago

What kind of network plugin are you using?

Network type (plugin): Azure CNI
Kubernetes version: 1.25.5

This can happen on scale down if there are any connections on the node being scaled down that are being fwded by the kube-proxy to some other node, that will reset them. Is that what you're seeing?

I'm not sure how I can tell if this is what I'm seeing 😅

It might helpful to get a ticket going so we can take a look

That would be great, however, I'll have to check with management if we can purchase a support plan first ;)

image

In the mean time, please let me know if there is any more information from me that could be helpful 🙏

ghost commented 1 year ago

Action required from @Azure/aks-pm

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 11 months ago

Issue needing attention of @Azure/aks-leads

ghost commented 11 months ago

Issue needing attention of @Azure/aks-leads

thomastvedt commented 4 months ago

We disabled node autoscaling because of this issue, we are still very interested in a fix for this as we could run on two nodes most of the time and 3 nodes during work hours.

microsoft-github-policy-service[bot] commented 3 months ago

Issue needing attention of @Azure/aks-leads

aslafy-z commented 3 months ago

There were some changes in the probe setup at the cloud-provider-azure level, check out https://cloud-provider-azure.sigs.k8s.io/topics/loadbalancer/#custom-load-balancer-health-probe You might want to precise the default path for the probe

microsoft-github-policy-service[bot] commented 3 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 2 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 2 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 1 month ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 1 month ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 3 weeks ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 1 week ago

Issue needing attention of @Azure/aks-leads