Closed marwanad closed 3 months ago
@adrienjt curious what your thoughts are/if there's a better way you see to support this usecase.
Thank you for this PR.
I don't think that the NodeResourcesFit plugin fails at the PreFilter step, because it appears to only save data and cannot return any error.
However, the order of the Filter plugins matters, and it's possible that the NodeResourcesFit plugin runs before our plugin.
So we could reorder the plugins, giving our plugin a chance to send a candidate, and the NodeResourcesFit Filter plugin would eventually succeed when resources are reconciled on the virtual node (candidates survive scheduling cycles). We could also disable the NodeResourcesFit Filter step (not very useful as you noted), but I wouldn't want to disable the whole plugin, because we actually need the Score step to implement the LeastAllocated/MostAllocated bin-packing strategies.
Indeed, we need to keep a lot of the default plugins, so the scheduler config needs to be crafted carefully, ideally without repeating the default config, to reduce maintenance cost.
I think (not tested) that to make the proxy plugin Filter step run first, the config would look like this:
profiles:
- schedulerName: admiralty-proxy
plugins:
multiPoint:
enabled:
- name: proxy
filter:
enabled:
- name: proxy
And to disable the NodeResourcesFit Filter step, the config would look like this:
profiles:
- schedulerName: admiralty-proxy
plugins:
multiPoint:
enabled:
- name: proxy
filter:
disabled:
- name: NodeResourcesFit
This came up while testing a setup with a target cluster that autoscales a GPU pool from zero and a pod requesting GPU resources. The pod remains pending with:
This is because the virtual node representing the target cluster doesn't have the GPU capacity populated (it can't fetch it anyways because the target cluster has 0 nodes). The expectation is that the proxy-scheduler would still create the pod chaperon, autoscaler kicks in and the rest of the binding process happens but that wasn't the case.
On investigation, it seems that the
Filter
extension point in the proxy scheduler never gets to execute. The reason is because the scheduler configuration is using amultiPoint
extension point. To quote the docs:This means that all of
NodePorts
,PodTopologySpread
,VolumeBinding
, NodeResourcesFitand other plugins are executed. The
NodeResourcesFitwould reject the pod at the
PreFilter` step.There's few things I attempted here:
PreFilter
extension point in the custom scheduler and returningnil, framework.Success
. This won't work because allPreFilter
plugins must return success or the pod gets rejected so if any of the default plugins fail, the pod is rejected. It seems we need to explicitly disable the failing plugin.multipoint
config, explicitly define the extension points we implement in the plugin in the configpreFilter, filter, reserve, preBind, score
and enable the proxy plugin for them. That way, the default plugins won't be added. The only downside is that for every new extension point we add, we'd need to modify the config. I don't forsee these changing often.NodeResourcesFit
explicitly (and potentially other default plugins). This PR just does it forNodeResourcesFit
to enable scaling from zero but I think we could expand the list to cover all of them.I'm indifferent between 2 and 3, to me it seems that the purpose of the proxy scheduler is to handle the chaperoning and report the status back from the real schedulers in the target cluster so 2 might also make sense here if we don't care about any of the default scheduler plugins.
I believe this is also what https://github.com/admiraltyio/admiralty/issues/202 is seeing since they're attempting to scale from 0, the v-node will never register see the GPU allocatable, the
PreFilter
step will fail on theNodeResourcesFit
in the proxy scheduler