admiraltyio / admiralty

A system of Kubernetes controllers that intelligently schedules workloads across clusters.
https://admiralty.io
Apache License 2.0
668 stars 87 forks source link

Cloud Burst Configuration -> Pod does not trigger auto-scale #202

Open maaft opened 9 months ago

maaft commented 9 months ago

I have two clusters:

  1. on-prem: cluster with 2 GPU nodes
  2. cloud: cluster with 0-5 GPU nodes (auto-scaling is active)

I tried to use the proposed solution for "cloud bursting":

  1. I schedule two GPU workloads on on-prem with multicluster.admiralty.io/elect: "" annotation
  2. both workloads will be scheduled by admiralty and are running in my on-prem cluster as expected
  3. I schedule another GPU workload. This is now in Pending state forever, since admiralty doesn't even try to schedule on my cloud cluster due to apparent missing resources (Insufficient nvidia.com/gpu),

How can I tell admiralty to schedule to my cloud cluster, even though currently there are no free resources available so the scheduled Pod can trigger auto-scaling?

If this is not possible, I don't see the point of "Cloud Bursting" here since I'd need to have my cloud resources "always on" and pay for them.

maaft commented 9 months ago

Also I noticed, that even when my cloud cluster has nvidia.com/gpu = 1, this information is not propagated to the virtual-node in my on-prem cluster.

cloud cluster

kubectl describe node aks-gpu-10181809-vmss000000

Name:               aks-gpu-10181809-vmss000000
Roles:              agent
Labels:             accelerator=nvidia
Unschedulable:      false
Lease:
  HolderIdentity:  aks-gpu-10181809-vmss000000
  AcquireTime:     <unset>
  RenewTime:       Tue, 21 Nov 2023 13:44:43 +0100

Capacity:
  cpu:                4
  ephemeral-storage:  129886128Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             28736348Ki
  nvidia.com/gpu:     1
  pods:               110
Allocatable:
  cpu:                3860m
  ephemeral-storage:  119703055367
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             24487772Ki
  nvidia.com/gpu:     1
  pods:               110
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                670m (17%)   2700m (69%)
  memory             1240Mi (5%)  5660Mi (23%)
  ephemeral-storage  0 (0%)       0 (0%)
  hugepages-1Gi      0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)
  nvidia.com/gpu     0            0

on-prem cluster

kubectl describe node admiralty-aks101cluster

Name:               admiralty-aks101cluster
Roles:              agent
Labels:             alpha.service-controller.kubernetes.io/exclude-balancer=true
                    kubernetes.io/role=agent
                    multicluster.admiralty.io/cluster-target-name=aks101cluster
                    node-role.kubernetes.io/agent=
                    node.kubernetes.io/exclude-from-external-load-balancers=true
                    type=virtual-kubelet
                    virtual-kubelet.io/provider=admiralty
Taints:             virtual-kubelet.io/provider=admiralty:NoSchedule

Non-terminated Pods:          (0 in total)
  Namespace                   Name    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----    ------------  ----------  ---------------  -------------  ---
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests  Limits
  --------           --------  ------
  cpu                0 (0%)    0 (0%)
  memory             0 (0%)    0 (0%)
  ephemeral-storage  0 (0%)    0 (0%)

Is this (another) bug?

adrienjt commented 8 months ago

The virtual node should have capacity and allocatable including nvidia.com/gpu, as implemented here: https://github.com/admiraltyio/admiralty/tree/master/pkg/controllers/resources

So I suspect a configuration issue. Are you able to run the quick start on these clusters?