aerospike / aerospike-kubernetes-operator

Kubernetes operator for the Aerospike database
https://docs.aerospike.com/cloud/kubernetes/operator
Apache License 2.0
91 stars 36 forks source link

Karpenter scaling with k8sNodeBlockList throw errors #305

Open mateusmuller opened 2 months ago

mateusmuller commented 2 months ago

Folks,

I updated a static AerospikeCluster manifest with a bunch of EKS nodes on k8sNodeBlockList. This triggered an update as expected:

NAME                     READY   STATUS    RESTARTS   AGE
aerospike-1-0   2/2     Running   0          7h17m
aerospike-1-1   2/2     Running   0          7h16m
aerospike-1-2   0/2     Pending   0          82s
aerospike-2-0   2/2     Running   0          4h17m
aerospike-2-1   2/2     Running   0          4h17m
aerospike-2-2   2/2     Running   0          4h17m
aerospike-3-0   2/2     Running   0          27h
aerospike-3-1   2/2     Running   0          28h
aerospike-3-2   2/2     Running   0          27h

Although pod aerospike-1-2 keeps there forever. This is the error message from Karpenter:

2024-07-24T20:46:27.479Z    DEBUG   controller.provisioner  ignoring pod, label kubernetes.io/hostname is restricted; specify a well known label: [karpenter.k8s.aws/instance-category karpenter.k8s.aws/instance-cpu karpenter.k8s.aws/instance-encryption-in-transit-supported karpenter.k8s.aws/instance-family karpenter.k8s.aws/instance-generation karpenter.k8s.aws/instance-gpu-count karpenter.k8s.aws/instance-gpu-manufacturer karpenter.k8s.aws/instance-gpu-memory karpenter.k8s.aws/instance-gpu-name karpenter.k8s.aws/instance-hypervisor karpenter.k8s.aws/instance-local-nvme karpenter.k8s.aws/instance-memory karpenter.k8s.aws/instance-network-bandwidth karpenter.k8s.aws/instance-pods karpenter.k8s.aws/instance-size karpenter.sh/capacity-type karpenter.sh/provisioner-name kubernetes.io/arch kubernetes.io/os node.kubernetes.io/instance-type topology.kubernetes.io/region topology.kubernetes.io/zone], or a custom label that does not use a restricted domain: [k8s.io karpenter.k8s.aws karpenter.sh kubernetes.io] {"commit": "dc3af1a", "pod": "datastore-shared/aerospike-1-2"}

Basically they don't allow kubernetes.io/hostname with NodeAffinity. This is what happens with that flag:

  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - us-east-1a
          - key: kubernetes.io/hostname
            operator: NotIn
            values:
            - <list of nodes>

I found this issue under Karpenter repo with the same issue where they say the usage is wrong.

Can you please share your thoughts if this can be improved somehow? Thanks.

sud82 commented 2 months ago

@mateusmuller Thanks for reporting. We are looking into this.

mateusmuller commented 2 months ago

Hey @sud82, thanks for looking into this.

Sorry to be pushy, but would you have any updates about this issue?

Karpenter is our official tool for compute autoscaling (so as for pretty much every EKS user). This might be a go/no-go decision for us.

Thanks!

sud82 commented 2 months ago

Hi @mateusmuller, We have looked into this. I am adding our findings

When a user populates the k8sNodeBlockList, AKO sets node affinity with the key kubernetes.io/hostname to prevent pods from being scheduled on the nodes listed in the k8sNodeBlockList.

However, Karpenter restricts the use of the kubernetes.io/hostname label as it may interfere with its internal scheduling mechanisms. You can find the relevant code reference here: karpenter/pkg/apis/v1beta1/labels.go at d5660acf4472db796d5f4fac58a147d14b320451 · kubernetes-sigs/karpenter

This issue does not occur when using the Kubernetes Cluster Autoscaler.

Recovery:

Once nodes are removed from the k8sNodeBlockList, the Karpenter autoscaler resumes normal operation and can scale the pods as expected.

Suggestion:

After migrating Aerospike pods from all nodes listed in the k8sNodeBlockList to other nodes, users should clear the k8sNodeBlockList from the spec.

I know, the suggestion seems a bit inconvenient but Karpenter and K8sNodeBlockList features have conflicting requirements. Therefore, our options are very limited.

Can you please explain your use case where you want to use k8sNodeBlockList along with auto-scaling, what kind of node storage you have, and so on?

mateusmuller commented 2 months ago

Hello @sud82,

Can you please explain your use case where you want to use k8sNodeBlockList along with auto-scaling, what kind of node storage you have, and so on?

The usecase for k8sNodeBlockList is the same as described on your doc.

List of Kubernetes nodes that are disallowed for scheduling the Aerospike pods. Pods are not scheduled on these nodes and migrated from these nodes if already present.

When I would use this? When I want to rotate the nodes. Upon rotation, Karpenter can pull the latest AMI from AWS with security patches and/or new features. AFAIK, that's a basic usage of Kubernetes ecossytem.

We use ebs-csi for /opt/aerospike and local-static to expose EC2 instance store NVMe. Although, the underlying storage system doesn't seem relevant here, unless I misunderstood something.

To be clear how to move forward, there will be no changes from AKO perspective to be compatible with Karpenter, is that correct?

If yes, I would recommend removing Karpenter from your autoscaling doc here since it's not fully compatible with your features, and keep only Cluster Autoscaler.

sud82 commented 2 months ago

Thanks for all the details @mateusmuller.

I wanted to say that at present there is no workaround or a quick fix for this. We got to know about it when you reported this. But, we will try to find a solution for this in the future. We definitely want to support Karpenter.

The main issue here is that we are using hostName for ensuring that pods are not scheduled in the given k8s nodes (your requirement is also that). But, the Karpenter doesn't allow using hostName. That's why this feature is not working.

We need to find out a new way to disallow pods in the given k8s nodes. That will take some time. We will also reach out to the Karpenter team to get their perspective. At present, it seems like a sweeping check in the Karpenter.