Support running steps as Kubernetes Jobs

mikekap commented 7 years ago

It would be very convenient to run steps as kubernetes Job. This would enable using autoscaling (machines) based on kubernetes resource utilization, as well as sizing steps' resources for maximal cluster utilization. As it is now, you can size the agent itself, but you have to give it enough resources to run your maximally large step.

keithpitt commented 7 years ago

Interesting... I've not used a lot of kubernetes myself, but I've been meaning to have a play with it! Have you had any thoughts on how you'd see this working? I'd love to throw around some ideas here and see what we come up with :)

mikekap commented 7 years ago

The approach I had in mind would be something like a two process model:

A master process that watches the buildkite API and creates Job objects. The master can run inside the kube cluster as a constant deployment, but it would only use a small amount of resources. It should also be able to use some config options from the step to configure the kubernetes job as necessary with resource limits and (maybe) environment variables & secrets. For failure recovery, it could use Kube annotations to get the current scheduled set of Jobs and job ids that are active.
A slave process that gets launched by the master as a Kube Job. A Job is essentially just a docker run so this would be a stripped down agent that runs a specific step and then exits. This process would include all the logic for running a step - sending back results to buildkite or running any agent plugins needed for the step. After finishing the step, the slave would exit (always successfully, so kube doesn't reschedule the job).

Getting this to work isn't terribly hard. The buildkite agent just needs a mode where it runs a specific job id and exits. The master is a bit more work, but should be pretty easy to do since it just listens for incoming jobs and schedules them - it doesn't even have to keep track of completion status, since Kube does that.

If those two are set up, just running something like https://github.com/openai/kubernetes-ec2-autoscaler will auto-scale instances as jobs come in, which would be pretty awesome :)

lox commented 6 years ago

The buildkite agent just needs a mode where it runs a specific job id and exits.

We implemented --disconnect-after-job a while back which will disconnect after running a single job, which gets some of the way there.

We're hoping to get some time to focus on a Kubernetes stack, which will include some of this stuff. Thanks for your patience!

regner commented 5 years ago

Any update for progress on this? is it on the roadmap at all?

lox commented 5 years ago

No progress yet I'm afraid, have you seen https://github.com/webflow/kubekite?

nullren commented 5 years ago

It would be really nice if there were an option to combine with --disconnect-after-job providing a way to target a specific job ID. This way when launching a buildkite-agent as a job in kubernetes, you would know specifically what job it is running.

@lox is there an API that can be called from the agent that can either coerce ping to target a specific job? Mostly getting the idea by looking at both of these which seems like a good entry point for that. https://github.com/buildkite/agent/blob/ce69197180f2ff5b17a64259f46731d19a98c9ca/agent/agent_worker.go#L49 https://github.com/buildkite/agent/blob/ce69197180f2ff5b17a64259f46731d19a98c9ca/agent/agent_worker.go#L93

Thoughts?

Globegitter commented 5 years ago

@Iox what is the priority of this, while I did get kubekite to work I did have to do some manual work on it and it does feel like it should be supported officially.

prestonvanloon commented 4 years ago

Any update on this?

yob commented 4 years ago

It would be really nice if there were an option to combine with --disconnect-after-job providing a way to target a specific job ID. This way when launching a buildkite-agent as a job in kubernetes, you would know specifically what job it is running.

Version 3.17.0 of the agent (released December 2019 added an --acquire-job flag that does exactly this.

$ buildkite-agent start --help | grep acquire
   --acquire-job value                    Start this agent and only run the specified job, disconnecting after it's finished [$BUILDKITE_AGENT_ACQUIRE_JOB]

We're not using it in any official buildkite tools yet. However anecdotally we know of a few folks using it to create one-shot agents on fargate and kubernetes. Generally there's an operator running somewhere that detects new jobs and creates a new agent pod to run the job then exit.

If you're interested in developing the ideas further, there's some occasional discussion in the #kubernetes channel on the community slack. We'd love to build and release something official for Kubernetes, but it's not on our near-term roadmap for now (too much to do, not enough time).

keithduncan commented 3 years ago

I think I’m going to close this with the addition of --acquire-job, the agent itself supports this nicely and anything else would be agent orchestration layered on top like the Elastic CI Stack for AWS, buildkite/helm etc 😄 :tada:

buildkite / agent

Support running steps as Kubernetes Jobs #420