Closed jakebolewski closed 4 years ago
this is not a problem with multiple builds, just multiple jobs with a waiter type
Ok the wait
issue is just a proxy for anly mildly complex DAG with sync dependencies
If we assign a jobid
to a non script task (wait, etc.) then the build runner will fail and then another agent can meanwhile be assigned to a dependent task that needs to depend wait
sync point, but this will also fail as the task is visible but cannot be scheduled until the commands before the wait
finish with success.
This seems to be a limitation of the way we assign 1 build runner to 1 task id, a workaround would be to assign all jobs before wait (this is visible / build object), poll for success before sync point, then issue all jobs after sync point.
Given the way the api works, the cron job can remain stateless with some added latency.
we could also build the full slurm job dag dependency up front and just have wait
be a trivial job that always returns success and let slurm figure it out, the issue IIRC is that for dependent jobs if upstream jobs fail they need to be manually cleaned out of the slurm queue.
pipelines with multiple steps currently do not work, I think we have to assign 1 slurm job to one step uuid instead of job id