Closed yuvipanda closed 4 months ago
This was requested by @jmunroe and @julianpistorius.
Thank you for this @yuvipanda! I'm sure we'll fill any remaining gaps soon. I think this is enough for us to go on for now.
yw, @julianpistorius :)
@yuvipanda How negotiable is the 'scale to 0' requirement? Does this requirement spring from a requirement to save communities money when they use commercial cloud? Or is there some other critical reason for requiring this?
Background: The cluster autoscaler for Jetstream2's managed Kubernetes service can't scale to 0 (yet). However the Jetstream2 resources are provided without charge to qualifying US-based researchers, so hopefully that makes this less of an issue.
@julianpistorius it's primarily because we try to offer multiple machine size options via different node pools, and if they don't scale to 0 we don't have just 1 node running empty but 3-4. So we'd have to shift how we offer options for folks to spawn on. There's also energy usage concerns with leaving machines on that aren't being used, which is particularly significant for GPU instances.
So I'd say that:
I hope that helps answer the question!
Thank you for explaining the rationale @yuvipanda! That helps answer my question, and makes a lot of sense.
Even though the OpenStack Cluster API provider doesn't explicitly support autoscaling to zero: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues/1328
.. according to https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/autoscaling.html#scale-from-zero-support it might still be possible:
If your Cluster API provider does not have support for scaling from zero, you may still use this feature through the capacity annotations. You may add these annotations to your MachineDeployments, or MachineSets if you are not using MachineDeployments (it is not needed on both), to instruct the cluster autoscaler about the sizing of the nodes in the node group. At the minimum, you must specify the CPU and memory annotations, these annotations should match the expected capacity of the nodes created from the infrastructure.
For example, if my MachineDeployment will create nodes that have “16000m” CPU, “128G” memory, “100Gi” ephemeral disk storage, 2 NVidia GPUs, and can support 200 max pods, the following annotations will instruct the autoscaler how to expand the node group from zero replicas:
apiVersion: cluster.x-k8s.io/v1alpha4 kind: MachineDeployment metadata: annotations: cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size: "5" cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "0" capacity.cluster-autoscaler.kubernetes.io/memory: "128G" capacity.cluster-autoscaler.kubernetes.io/cpu: "16" capacity.cluster-autoscaler.kubernetes.io/ephemeral-disk: "100Gi" capacity.cluster-autoscaler.kubernetes.io/gpu-type: "nvidia.com/gpu" capacity.cluster-autoscaler.kubernetes.io/gpu-count: "2" capacity.cluster-autoscaler.kubernetes.io/maxPods: "200"
I'll work with @sd109 and @mkjpryor from @StackHPC to see what's possible on Jetstream2.
As part of our project pythia grant (https://github.com/2i2c-org/meta/issues/769 has more information), we keep an eye on how we can better support running infrastructure on Jetstream2.
As part of this, we have been asked to provide information on what our needs with respect to managed kubernetes are. This issue tracks these questions, and helps us provide answers in a central location.
StorageClass
dynamic provisionernginx-ingress
inside the cluster, and it needs a single IP / CNAME we can point DNS records toWhile not absolutely complete, this is a start!