admiraltyio / admiralty

A system of Kubernetes controllers that intelligently schedules workloads across clusters.
https://admiralty.io
Apache License 2.0
673 stars 87 forks source link

fix: avoid early return when waiting for binding in the proxy plugin #208

Closed marwanad closed 1 week ago

marwanad commented 3 months ago

candidateIsBound is polled in PreBind - returning an error would exit the polling loop after the first evaluation and rejecting the pod early. This fixes the method to match the desired polling behaviour.

adrienjt commented 3 months ago

Could you please describe a bug that this fixes?

AFAIU, if a pod has the PodScheduled condition, and it is not True, it means that binding failed, so we shouldn't wait.

Before that, the pod doesn't have the PodScheduled condition, so the first iteration returns false, nil already.

marwanad commented 3 months ago

I'll try and get you a concrete repro but roughly I believe it was a race in the binding cycle with many pods in the scheduling queue. It's some variant of what I describe below. We've ran into this case a long time ago so the details are a bit hazy :) but I can get a repro in place before we accept this.

Assume a pod that was admitted (reserved and allowed), is marked unschedulable in that first cycle ie. there's a PodScheduled condition set to false.

  1. Some event, let's say autoscaling, means that this pod can be scheduled again.
  2. The plugin extension points in both schedulers all run fine till we get to PreBind.
  3. Now both binding cycles can potentially happen at the same time. If the PodScheduled condition wasn't set yet, that's fine. But in the case where the proxy's PreBind runs before the candidates PreBind, it will see the PodScheduled as false and fail its PreBind step, add the pod to the list of unreserved and it could be a few iterations before it is bound to the virtual node. The real delegate pod is still able to get scheduled though.
adrienjt commented 3 months ago

a race in the binding cycle with many pods in the scheduling queue

What was the consequence?

I can get a repro in place before we accept this

Yes, please, not only to better understand the problem, but to make it an e2e test case if possible.

I think that your proposal would make use cases without candidate schedulers (using the no-reservation annotation) take too long, because they'd poll for the full one minute for each virtual node. Is there a better way?

marwanad commented 1 month ago

@adrienjt finally spent some time trying to repro the slowness we saw. It turns out the root-cause is because a pod chaperone in a target cluster ends up with no Conditions set and the polling loop in the Filter step runs for the full wait duration (30 seconds). It was particularly bad in our case because the blocking pod was of higher priority than the rest of the pods in the queue so the scheduling cycle was super slow

https://github.com/admiraltyio/admiralty/blob/b704bb7625090af0aacdcf4c1e444d72ce960b8d/pkg/scheduler_plugins/proxy/plugin.go#L125

https://github.com/admiraltyio/admiralty/blob/b704bb7625090af0aacdcf4c1e444d72ce960b8d/pkg/scheduler_plugins/proxy/plugin.go#L150

This can happen when the chaperon controller fails to create the pods. For example, if the service account is missing in the target cluster, no pods will be created and the reconcile loop will terminate early before setting the status. In our case, we intentionally didn't create the pod dependencies in that target cluster because the workload was never intended to run on it (we totally can avoid this from happening in the first place via theproxy-pod-scheduling-constraints annotation to limit where the chaperon lands) but I think we should still have a terminal status for failed/missing pod creations on the chaperon to avoid such cases.

https://github.com/admiraltyio/admiralty/blob/b704bb7625090af0aacdcf4c1e444d72ce960b8d/pkg/controllers/chaperon/controller.go#L118-L123

I think we can close this and I'll follow-up with a proper fix in the chaperon controller if that sounds okay.