cirruslabs / cirrus-ci-docs

Documentation for Cirrus CI 📚
https://cirrus-ci.org
MIT License
351 stars 109 forks source link

Add `weight` field so tasks can be prioritized. #525

Open eksperimental opened 4 years ago

eksperimental commented 4 years ago

Description

A weight field can be added to tasks, so a priority order can be set for their execution.

Context

While tasks can be run sequentially, by using the depends_on field, there are cases where we want to run in parallel, but run some before others, as for example, tasks that have skip_notification: true and/or that have allowed_failures: true could be run at the end of the queue but in a concurrent fashion, not having to depend on any other task to finish.

Anything Else

This proposal steams from the following discussion: https://github.com/cirruslabs/cirrus-ci-docs/issues/524

Thank you!

fkorotkov commented 4 years ago

Just to clarify: this feature will put weights on priority of tasks when scheduled. Therefore it will affect when tasks are starting and not be a replacement for depends_on since some tasks with lower weights can start executing while tasks with larger weights are still executing.

RDIL commented 4 years ago

Yup.

timwoj commented 2 years ago

I'd love to see this implemented as well. One of our tasks takes about 4 times as long as the others. All of the tasks are independent from each other, so it'd be nice if that one task could always be scheduled earlier than the others, so it can run while the others are getting scheduled, delayed, etc.

fkorotkov commented 2 years ago

In the ideal situation all the tasks should be scheduled and run in parallel so that's why we've never had any prioritizations features.

@timwoj do you hit this issue when you reach the concurrency limits? In a situation when a user reaches the limit we do put the scheduling requests in a "queue"-like thing where we can add prioritization/weight. But in the first place all the tasks are processed in parallel and they do fight for scheduling on first come, first served basis.

If we'll add the prioritization for the queue thing then we won't guarantee that such long tasks will always be scheduled first. But in case of delaying the scheduling when concurrency limits are reached, we'd be able to guarantee that the heavy tasks will be scheduled first out of the queue.

For example, you have 10 parallel tasks that request 4 cores each. 9 tasks usually finish in 5 minutes and one is very slow and finishes in 20. In the worst case scenario 4 fast tasks will be scheduled first and the rest will be added to the queue where the slow tasks will be in the front.

ckreibich commented 2 years ago

@fkorotkov we always hit the concurrency limit in our setup — we have nearly two dozen tasks in our setup, mostly different Linux containers, also FreeBSD, all running with a minimum of 4 cores. So prioritizing when your queueing mechanism kicks in is exactly what we're hoping for.

(I work with @timwoj; same setup.)

fkorotkov commented 2 years ago

I just realized that we can reuse timeout_in field for prioritization.

Right now the queue is sorted by only creationTimestamp but I think it's reasonable to include timeout_in in sorting criteria. So tasks with the longer expected timeout will be scheduled first.

bbannier commented 2 years ago

I just realized that we can reuse timeout_in field for prioritization.

I wonder whether it wouldn't still make sense to allow specifying some explicit task priority. If you'd implicitly derived some priority proportional to say -timeout_in this would mean that it would that it becomes impossible to run a task which can finish (and fail!) quickly as first task. Such a task could e.g., run basic sanity checks.

I don't know how it is possible to to fail a whole suite if a single tasks failed, but for my daily work sometimes I can fix something found in a failed task faster than it takes for the whole suite to finish; a new push to the PR would then cancel the still outstanding tasks from the old state. If -timeout_in would be a tasks priority I would have to wait longer which seems neither ideal for my workflow nor for Cirrus from an utilization standpoint.

fkorotkov commented 2 years ago

I'm a bit hesitant about introducing an explicit task priority because of the users who use compute credits or if they bring their own compute. In this case these users don't have queueing at all. Everything scales up and down to execute tasks as fast as possible and as efficient as possible.

Queuing makes sense only for the free tier or for folks with private plan who still gets limited but at a higher concurrency.

Ideally Cirrus should use historic data to estimate expected duration of a task and use that estimation to prioritize the queue. Plus it should take into the account nature of the build: de-prioritize tasks for PRs and favor tasks for the default branch. releases and tags. We are not there yet but we'll work on this in the next couple of weeks.

As for "fail fast" option. Do you think it will be useful to have a Starlark module that can automatically cancel build in case of failures of particular tasks?

bbannier commented 2 years ago

Queuing makes sense only for the free tier or for folks with private plan who still gets limited but at a higher concurrency.

I'd argue that it still would make sense to expose a knob for the scheduler which I assume has to exist (even if it is only FIFO).

Ideally Cirrus should use historic data to estimate expected duration of a task and use that estimation to prioritize the queue. Plus it should take into the account nature of the build: de-prioritize tasks for PRs and favor tasks for the default branch. releases and tags. We are not there yet but we'll work on this in the next couple of weeks.

The use case I mentioned above was less about whether Cirrus needs a fail-fast mode, but more an example of where the implicit scheduling algorithm you proposed does the exact opposite of what would be useful for that particular scenario. The more complex way to compute implicit priority seems to have similar issues.

I am not sure whether giving users a way to specify relative job priorities (inside a single suite run) is prohibitively hard from your side, but from the user side this seems to me to be a relatively standard and well understood feature of schedulers (e.g., in orchestrators like k8s tasks can often have a priority).

A model I could imagine would be that jobs by default get assigned an implicit integer priority=0 which would allow users to set higher priorities for jobs which should be scheduled earlier (e.g., priority=100), or lower priorities (e.g., priority=-100) for jobs which would be scheduled after all higher priority jobs have been placed. If tasks have the same priority it might still make sense to use e.g., historic data to compute some subpriority as a tie breaker. A scheduler would than try to place tasks in order of decreasing priority.

As for "fail fast" option. Do you think it will be useful to have a Starlark module that can automatically cancel build in case of failures of particular tasks?

I haven't used Starlark with Cirrus so I cannot comment on whether that would be a good place to put such functionality, but in general fail-fast mode can be useful. GH actions enable this mode by default in matrix jobs (which probably helps them reduce resource contention in their free tier), https://docs.github.com/en/actions/learn-github-actions/workflow-syntax-for-github-actions#jobsjob_idstrategyfail-fast.

That being said, fail-fast seems orthogonal to scheduling priority and should live at the matrix or suite level (priority should be a task or lower-level property). IMO whether a task is fail-fast or not should probably not be taken into account for computing priority because that would again lead to intransparent behavior.

ckreibich commented 2 years ago

Just a ping here to see if we can make progress. @fkorotkov, you said:

Queuing makes sense only for the free tier or for folks with private plan who still gets limited but at a higher concurrency.

Wouldn't it be possible to add explicit queueing and just have this be a no-op, possibly with a warning message, to folks who aren't limited? I doubt they would care, and meanwhile those of us who keep getting throttled would save a bunch of time.

ckreibich commented 1 year ago

Just a half-year ping to confirm that we will still buy you buckets of :beer: if you implement this. Thanks. :smile: