Closed marwanad closed 1 week ago
Could you please describe a bug that this fixes?
AFAIU, if a pod has the PodScheduled condition, and it is not True, it means that binding failed, so we shouldn't wait.
Before that, the pod doesn't have the PodScheduled condition, so the first iteration returns false, nil
already.
I'll try and get you a concrete repro but roughly I believe it was a race in the binding cycle with many pods in the scheduling queue. It's some variant of what I describe below. We've ran into this case a long time ago so the details are a bit hazy :) but I can get a repro in place before we accept this.
Assume a pod that was admitted (reserved and allowed), is marked unschedulable in that first cycle ie. there's a PodScheduled
condition set to false.
PreBind
.PodScheduled
condition wasn't set yet, that's fine. But in the case where the proxy's PreBind
runs before the candidates PreBind
, it will see the PodScheduled
as false
and fail its PreBind
step, add the pod to the list of unreserved and it could be a few iterations before it is bound to the virtual node. The real delegate pod is still able to get scheduled though.a race in the binding cycle with many pods in the scheduling queue
What was the consequence?
I can get a repro in place before we accept this
Yes, please, not only to better understand the problem, but to make it an e2e test case if possible.
I think that your proposal would make use cases without candidate schedulers (using the no-reservation annotation) take too long, because they'd poll for the full one minute for each virtual node. Is there a better way?
@adrienjt finally spent some time trying to repro the slowness we saw. It turns out the root-cause is because a pod chaperone in a target cluster ends up with no Conditions
set and the polling loop in the Filter
step runs for the full wait duration (30 seconds). It was particularly bad in our case because the blocking pod was of higher priority than the rest of the pods in the queue so the scheduling cycle was super slow
This can happen when the chaperon controller fails to create the pods. For example, if the service account is missing in the target cluster, no pods will be created and the reconcile loop will terminate early before setting the status. In our case, we intentionally didn't create the pod dependencies in that target cluster because the workload was never intended to run on it (we totally can avoid this from happening in the first place via theproxy-pod-scheduling-constraints
annotation to limit where the chaperon lands) but I think we should still have a terminal status for failed/missing pod creations on the chaperon to avoid such cases.
I think we can close this and I'll follow-up with a proper fix in the chaperon controller if that sounds okay.
candidateIsBound
is polled inPreBind
- returning an error would exit the polling loop after the first evaluation and rejecting the pod early. This fixes the method to match the desired polling behaviour.