buildkite / agent-stack-k8s

Spin up an autoscaling stack of Buildkite Agents on Kubernetes
MIT License
77 stars 30 forks source link

Controller stops accepting jobs from the cluster queue #302

Open aressem opened 5 months ago

aressem commented 5 months ago

We have the agent-stack-k8s up and running and works fine for a while. However, it suddenly stops accepting new jobs and the last thing it outputs is (we turned on debug):

2024-04-08T11:38:23.100Z    DEBUG   limiter scheduler/limiter.go:77 max-in-flight reached   {"in-flight": 25}

We currently only have a single pipeline, single cluster and single queue. When this happens there are no jobs or pods named buildkite-${UUID} in the k8s cluster. Executing kubectl -n buildkite rollout restart deployment agent-stack-k8s makes the controller happy again and it starts jobs from the queue.

I suspect that there is something that should decrement the in-flight number, but fails to do so. We are now running a test where this number is set to 0 to see if that works around the problem.

DrJosh9000 commented 4 months ago

Hi @aressem, did you discover anything with your tests where the number is set to 0?

aressem commented 4 months ago

@DrJosh9000 , the pipeline works as expected with in-flight set to 0. I don't know what that number might be now, but I suspect it is steadily increasing :)

artem-zinnatullin commented 3 months ago

Same issue when testing with max-in-flight: 1 on v0.11.0, at some point controller stops taking new jobs even though there are no jobs/pods running in the namespace besides the controller iteself.

2024-05-21T21:31:57.923Z    DEBUG   limiter scheduler/limiter.go:79 max-in-flight reached   {"in-flight": 1}