buildkite / agent-stack-k8s

Spin up an autoscaling stack of Buildkite Agents on Kubernetes
MIT License
79 stars 30 forks source link

Controller does not Delete Pending Jobs from K8S if they're cancelled on Buildkite — flooding the cluster #392

Open artem-zinnatullin opened 4 days ago

artem-zinnatullin commented 4 days ago

Testing a fix for #382 with controller:0.15.0-14-g68932d3 build I found that buildkite/agent-stack-k8s apparently does not have any (?) logic to delete Pending Jobs/Pods for cancelled jobs/builds!

We heavily rely on Cancel Intermediate Builds setting in Buildkite (see docs) which cancels in-flight builds on same branch when a new commit is pushed to a PR.

Google Chrome 2024-10-07 15 27 24

Current behavior of the buildkite/agent-stack-k8s controller keeps Pending Jobs/Pods in K8S even after after a Buildkite job/build cancelled thus flooding the K8S cluster with resource allocations, then actually starts those jobs and consumes CPU time leading to overspending $$$.


Expected behavior:

buildkite/agent-stack-k8s controller should send Job/Pod "Delete" request to K8S for a cancelled Buildkite Job that is not in Running state on K8S.

DrJosh9000 commented 4 days ago

Thanks for raising this @artem-zinnatullin, it's a good point.

artem-zinnatullin commented 3 days ago

Would it be possible to modify the logic around StaleCh <-chan added in #389 so that:

  1. Controller reacts to a Buildkite Job becoming stale
  2. Checks if there is a matching scheduled K8S Job in Pending status
  3. Sends "Delete" request to K8S for the matching Pending Job

wdyt @DrJosh9000?

As of right now this issue seems to be last missing bit before we can try to swap https://github.com/EmbarkStudios/k8s-buildkite-plugin to the official buildkite/agent-stack-k8s controller in production CI! 😅

DrJosh9000 commented 3 days ago

Thanks @artem-zinnatullin, that could probably be made to work. But I would like to dedicate a solid block of time to think about it - if we tackle this, it's likely to land in v0.17.0, since I'm planning on getting v0.16.0 out the door today.

artem-zinnatullin commented 3 days ago

👍 ❤️