CliMA / slurm-buildkite

Run buildkite jobs on a slurm cluster
Other
8 stars 1 forks source link

make multi-build steps work with wait #3

Closed jakebolewski closed 4 years ago

jakebolewski commented 4 years ago

pipelines with multiple steps currently do not work, I think we have to assign 1 slurm job to one step uuid instead of job id

jakebolewski commented 4 years ago

this is not a problem with multiple builds, just multiple jobs with a waiter type

jakebolewski commented 4 years ago

Ok the wait issue is just a proxy for anly mildly complex DAG with sync dependencies

If we assign a jobid to a non script task (wait, etc.) then the build runner will fail and then another agent can meanwhile be assigned to a dependent task that needs to depend waitsync point, but this will also fail as the task is visible but cannot be scheduled until the commands before the wait finish with success.

This seems to be a limitation of the way we assign 1 build runner to 1 task id, a workaround would be to assign all jobs before wait (this is visible / build object), poll for success before sync point, then issue all jobs after sync point.

Given the way the api works, the cron job can remain stateless with some added latency.

jakebolewski commented 4 years ago

we could also build the full slurm job dag dependency up front and just have wait be a trivial job that always returns success and let slurm figure it out, the issue IIRC is that for dependent jobs if upstream jobs fail they need to be manually cleaned out of the slurm queue.

jakebolewski commented 4 years ago

closed by https://github.com/CliMA/slurm-buildkite/commit/86f69314fa6caf2981efbe0740e4ea4524ccf39e