[BUG] Internal load balancer unstable when autoscaling cluster

thomastvedt commented 1 year ago

Describe the bug When cluster is autoscaling a service exposed on the internal load balancer stops responding for a short while. It looks like the internal load balancer becomes unstable when cluster is autoscaling.

More details:
We enabled autoscaling on our AKS cluster. Normally this runs at 2 nodes, when traffic increases it is scaled up to 3 for a short while, and then back down to 2 nodes.

We have a redis service hosted in our cluster, exposed as a LoadBalancer service, using annotations to use an internal load balancer:

  service:
    type: LoadBalancer
    ports:
      redis: 6379
    externalTrafficPolicy: Cluster
    annotations:
      service.beta.kubernetes.io/azure-load-balancer-internal: "true"
      service.beta.kubernetes.io/azure-load-balancer-ipv4: 10.111.16.123

This makes our Redis service available outside the cluster, inside our vnet, reachable by a different service that currently runs on a separate virtual machine.

When the cluster auto-scales up / down, we observe Redis timeout exceptions in our service running outside the cluster.

We observe a drop/spike in traffic to our Redis service:

The Redis pod/instance is not killed when autoscaling.

There is a drop in "health probe status" in the underlying load balancer resource in Azure control panel:

And data path availability:

We also observe an event in our cluster when this happens: "Updated load balancer with new hosts".

Expected behavior I expected that the internal load balancer would work 100% also when the cluster is autoscaling.

Environment (please complete the following information):

Kubernetes version [e.g. 1.25.5]

How can I troubleshoot this further? Can I view logs from the internal load balancer? Where can I find logs?

ghost commented 1 year ago

Action required from @Azure/aks-pm

palma21 commented 1 year ago

What kind of network plugin are you using? This can happen on scale down if there are any connections on the node being scaled down that are being fwded by the kube-proxy to some other node, that will reset them. Is that what you're seeing? It might helpful to get a ticket going so we can take a look

thomastvedt commented 1 year ago

What kind of network plugin are you using?

Network type (plugin): Azure CNI
Kubernetes version: 1.25.5

This can happen on scale down if there are any connections on the node being scaled down that are being fwded by the kube-proxy to some other node, that will reset them. Is that what you're seeing?

I'm not sure how I can tell if this is what I'm seeing 😅

This happens when scaling down from 3 to 2 nodes even if Redis was not on the node that was removed
This also happens when scaling up from 2 to 3 nodes even if Redis was not touched / moved

It might helpful to get a ticket going so we can take a look

That would be great, however, I'll have to check with management if we can purchase a support plan first ;)

In the mean time, please let me know if there is any more information from me that could be helpful 🙏

ghost commented 1 year ago

Action required from @Azure/aks-pm

ghost commented 1 year ago