fix: support scale from zero in the proxy scheduler

marwanad commented 8 months ago

This came up while testing a setup with a target cluster that autoscales a GPU pool from zero and a pod requesting GPU resources. The pod remains pending with:

Warning  FailedScheduling   3m52s       admiralty-proxy     0/5 nodes are available: 1 Insufficient nvidia.com/gpu

This is because the virtual node representing the target cluster doesn't have the GPU capacity populated (it can't fetch it anyways because the target cluster has 0 nodes). The expectation is that the proxy-scheduler would still create the pod chaperon, autoscaler kicks in and the rest of the binding process happens but that wasn't the case.

On investigation, it seems that the Filter extension point in the proxy scheduler never gets to execute. The reason is because the scheduler configuration is using a multiPoint extension point. To quote the docs:

Starting from kubescheduler.config.k8s.io/v1beta3, all default plugins are enabled internally through MultiPoint.

This means that all of NodePorts, PodTopologySpread, VolumeBinding, NodeResourcesFitand other plugins are executed. TheNodeResourcesFitwould reject the pod at thePreFilter` step.

There's few things I attempted here:

Implement the PreFilter extension point in the custom scheduler and returning nil, framework.Success. This won't work because all PreFilter plugins must return success or the pod gets rejected so if any of the default plugins fail, the pod is rejected. It seems we need to explicitly disable the failing plugin.
Instead of using multipoint config, explicitly define the extension points we implement in the plugin in the config preFilter, filter, reserve, preBind, score and enable the proxy plugin for them. That way, the default plugins won't be added. The only downside is that for every new extension point we add, we'd need to modify the config. I don't forsee these changing often.
Disable the NodeResourcesFit explicitly (and potentially other default plugins). This PR just does it for NodeResourcesFit to enable scaling from zero but I think we could expand the list to cover all of them.

I'm indifferent between 2 and 3, to me it seems that the purpose of the proxy scheduler is to handle the chaperoning and report the status back from the real schedulers in the target cluster so 2 might also make sense here if we don't care about any of the default scheduler plugins.

I believe this is also what https://github.com/admiraltyio/admiralty/issues/202 is seeing since they're attempting to scale from 0, the v-node will never register see the GPU allocatable, the PreFilter step will fail on the NodeResourcesFit in the proxy scheduler

marwanad commented 8 months ago

@adrienjt curious what your thoughts are/if there's a better way you see to support this usecase.

adrienjt commented 8 months ago

Thank you for this PR.

I don't think that the NodeResourcesFit plugin fails at the PreFilter step, because it appears to only save data and cannot return any error.

However, the order of the Filter plugins matters, and it's possible that the NodeResourcesFit plugin runs before our plugin.

So we could reorder the plugins, giving our plugin a chance to send a candidate, and the NodeResourcesFit Filter plugin would eventually succeed when resources are reconciled on the virtual node (candidates survive scheduling cycles). We could also disable the NodeResourcesFit Filter step (not very useful as you noted), but I wouldn't want to disable the whole plugin, because we actually need the Score step to implement the LeastAllocated/MostAllocated bin-packing strategies.

Indeed, we need to keep a lot of the default plugins, so the scheduler config needs to be crafted carefully, ideally without repeating the default config, to reduce maintenance cost.

I think (not tested) that to make the proxy plugin Filter step run first, the config would look like this:

    profiles:
      - schedulerName: admiralty-proxy
        plugins:
          multiPoint:
            enabled:
              - name: proxy
          filter:
            enabled:
              - name: proxy

And to disable the NodeResourcesFit Filter step, the config would look like this:

    profiles:
      - schedulerName: admiralty-proxy
        plugins:
          multiPoint:
            enabled:
              - name: proxy
          filter:
            disabled:
              - name: NodeResourcesFit

admiraltyio / admiralty

fix: support scale from zero in the proxy scheduler #203