Allow a way to specify extended resources for scale-from-zero scenario

himanshu-kun commented 2 years ago

What would you like to be added: There should be a mechanism in our autoscaler so that the user could specify any extended resources his nodes have and the autoscaler becomes aware of it, so that during scaling out a node group from zero it could consider that.

Why is this needed: It has been noticed that currently the autoscaler couldn't scale a node group from zero if the pod is requesting an extended resource. This is happening because the nodeTemplate which the autoscaler creates doesn't have the extended resources specified. However it is able to scale from one , because there the autoscaler can form nodeTemplate from an existing node. There are ways in AWS and Azure implementation of autoscaler to specify such resources.

himanshu-kun commented 2 years ago

I have thought of some possible solution: 1) We could enhance the logic of using an existing node in the worker group introduced using this PR to include extended resources also. But this will not solve the case when we don't have any node in any zone of the worker group. 2) We could enhance the nodeTemplate passing through shoot YAML feature (enabled in AWS and Azure as of now) to also pass extended resources. But this approach comes with certain drawbacks:

any update here would lead to rolling update of the worker group nodes
delivery of this would require changes to gardenlet, extensions and autoscaler so it would take time. 3) We could use something similar to what AWS and Azure do. They have a mechanism to add tags to their scaling groups(in our case its machineDeployment). From those tags , autoscaler could update the nodeTemplate with the resources. But we have one problem:
we currenlty don't allow customers to add tags directly to machinedeployment

cc @unmarshall @ashwani2k

himanshu-kun commented 2 years ago

Upstream issue worth noticing https://github.com/kubernetes/autoscaler/issues/1869

himanshu-kun commented 1 year ago

Post grooming decision

Specify the node template in the provider config section of the worker. From there, the corresponding extension will pick it up and populate the worker config which contains the NodeTemplate. This will be used to generate the machine class. The CA code at the moment does not consider the ephemeral storage in case of scale from zero.

Inside GetMachineDeploymentNodeTemplate

if len(filteredNodes) > 0 {
                klog.V(1).Infof("Nodes already existing in the worker pool %s", workerPool)
                baseNode := filteredNodes[0]
                klog.V(1).Infof("Worker pool node used to form template is %s and its capacity is cpu: %s, memory:%s", baseNode.Name, baseNode.Status.Capacity.Cpu().String(), baseNode.Status.Capacity.Memory().String())
                instance = instanceType{
                    VCPU:             baseNode.Status.Capacity[apiv1.ResourceCPU],
                    Memory:           baseNode.Status.Capacity[apiv1.ResourceMemory],
                    GPU:              baseNode.Status.Capacity[gpu.ResourceNvidiaGPU],
                    EphemeralStorage: baseNode.Status.Capacity[apiv1.ResourceEphemeralStorage],
                    PodCount:         baseNode.Status.Capacity[apiv1.ResourcePods],
                }
            } else {
                klog.V(1).Infof("Generating node template only using nodeTemplate from MachineClass %s: template resources-> cpu: %s,memory: %s", machineClass.Name, nodeTemplateAttributes.Capacity.Cpu().String(), nodeTemplateAttributes.Capacity.Memory().String())
                instance = instanceType{
                    VCPU:   nodeTemplateAttributes.Capacity[apiv1.ResourceCPU],
                    Memory: nodeTemplateAttributes.Capacity[apiv1.ResourceMemory],
                    GPU:    nodeTemplateAttributes.Capacity["gpu"],
                    // Numbers pods per node will depends on the CNI used and the maxPods kubelet config, default is often 110
                    PodCount: resource.MustParse("110"),
                }

We need to fix this part to consider ephemeral storage in the else part. Also, we need to fix the validation of NodeTemplate in the gardener provider extension to explicitly specify ephemeral storage without CPU, GPU, or memory.

gardener / autoscaler

Allow a way to specify extended resources for scale-from-zero scenario #132

Post grooming decision