Closed mikekap closed 3 years ago
Interesting... I've not used a lot of kubernetes myself, but I've been meaning to have a play with it! Have you had any thoughts on how you'd see this working? I'd love to throw around some ideas here and see what we come up with :)
The approach I had in mind would be something like a two process model:
docker run
so this would be a stripped down agent that runs a specific step and then exits. This process would include all the logic for running a step - sending back results to buildkite or running any agent plugins needed for the step. After finishing the step, the slave would exit (always successfully, so kube doesn't reschedule the job).Getting this to work isn't terribly hard. The buildkite agent just needs a mode where it runs a specific job id and exits. The master is a bit more work, but should be pretty easy to do since it just listens for incoming jobs and schedules them - it doesn't even have to keep track of completion status, since Kube does that.
If those two are set up, just running something like https://github.com/openai/kubernetes-ec2-autoscaler will auto-scale instances as jobs come in, which would be pretty awesome :)
The buildkite agent just needs a mode where it runs a specific job id and exits.
We implemented --disconnect-after-job
a while back which will disconnect after running a single job, which gets some of the way there.
We're hoping to get some time to focus on a Kubernetes stack, which will include some of this stuff. Thanks for your patience!
Any update for progress on this? is it on the roadmap at all?
No progress yet I'm afraid, have you seen https://github.com/webflow/kubekite?
It would be really nice if there were an option to combine with --disconnect-after-job
providing a way to target a specific job ID. This way when launching a buildkite-agent as a job in kubernetes, you would know specifically what job it is running.
@lox is there an API that can be called from the agent that can either coerce ping
to target a specific job? Mostly getting the idea by looking at both of these which seems like a good entry point for that.
https://github.com/buildkite/agent/blob/ce69197180f2ff5b17a64259f46731d19a98c9ca/agent/agent_worker.go#L49
https://github.com/buildkite/agent/blob/ce69197180f2ff5b17a64259f46731d19a98c9ca/agent/agent_worker.go#L93
Thoughts?
@Iox what is the priority of this, while I did get kubekite to work I did have to do some manual work on it and it does feel like it should be supported officially.
Any update on this?
It would be really nice if there were an option to combine with --disconnect-after-job providing a way to target a specific job ID. This way when launching a buildkite-agent as a job in kubernetes, you would know specifically what job it is running.
Version 3.17.0 of the agent (released December 2019 added an --acquire-job
flag that does exactly this.
$ buildkite-agent start --help | grep acquire
--acquire-job value Start this agent and only run the specified job, disconnecting after it's finished [$BUILDKITE_AGENT_ACQUIRE_JOB]
We're not using it in any official buildkite tools yet. However anecdotally we know of a few folks using it to create one-shot agents on fargate and kubernetes. Generally there's an operator running somewhere that detects new jobs and creates a new agent pod to run the job then exit.
If you're interested in developing the ideas further, there's some occasional discussion in the #kubernetes channel on the community slack. We'd love to build and release something official for Kubernetes, but it's not on our near-term roadmap for now (too much to do, not enough time).
I think I’m going to close this with the addition of --acquire-job
, the agent itself supports this nicely and anything else would be agent orchestration layered on top like the Elastic CI Stack for AWS, buildkite/helm etc 😄 :tada:
It would be very convenient to run steps as kubernetes Job. This would enable using autoscaling (machines) based on kubernetes resource utilization, as well as sizing steps' resources for maximal cluster utilization. As it is now, you can size the agent itself, but you have to give it enough resources to run your maximally large step.