Re-think scheduling of builds

alsuren commented 2 months ago

Currently we do:

for each arch in series:
- pick randomly 20 from the popular-crates.txt + the list of requested crates for that arch
- check and trigger builds for that package

[edit: This is problematic because

if we somehow blast through our rate limits by checking x86_64-apple-darwin then we may never build any packages for aarch64-apple-darwin
it doesn't have any sense of priority and we only check a small number of packages per arch. This means that the list of irrelevant packages grows, the expected time taken before we trigger a build goes up (right now we're still in the phase of playing catchup, so every cronjob seems to trigger at least one build per arch and it's less of a problem) ]

Here is a slightly convoluted scheme that might allow us to fairly make progress while also prioritising checking of things that are likely to be useful to people

construct the following list of crates (n lists for each arch):
- for each arch:
- the list of install requests with the "built-from-source" status in the last day (ordered by popularity, including versions - WARNING: this is very likely to result in the same crate being built many times in parallel if someone runs the cronjob manually on ci/locally)
- the list of install requests with the "built-from-source" status in the last day (shuffled, including versions)
- the list of install requests with any other status in the last day (shuffled, with versions stripped)
- popular-crates.txt (shuffled)
- a list with just cargo-quickinstall and nothing else
shuffle the list of lists and then round-robin between each list:
- take the head of the list
- if we already checked this package for this arch, skip it
- if we already tried to build this package 5 times in the last 7 days, skip it (TODO: also record the version in the failure log?)
- check its latest version + if we have a package built for it. If we don't have a package built for it, trigger a build
- if we already triggered m package builds since the start of the run, exit (probably set this to n times the number of arches?)
if we hit a github rate limit when doing this, this is fine and we just bomb out having made some progress?
if we get to the end without triggering any builds then that's fine too: there's nothing to do.

Questions:

1) would there be any value in including cargo-binstall's tree of fallback architectures here? 2) can we make this simpler while keeping it fair and resilient to rate limit problems 3) can we gain anything by doing things on a crate-by-crate basis (so we make the most of any crates.io index fetches we make)?

NobodyXu commented 2 months ago

WARNING: this is very likely to result in the same crate being built many times in parallel if someone runs the cronjob manually on ci/locally

We have a mechanism to cancel duplicate builds, though it does waste machine time as the previous in progress job is just cancelled.

A better way would to check, if that job already dispatched and is in progress in CI.

alsuren commented 2 months ago

ah actually I didn't think of that. Seems we have our concurrency set correctly:

concurrency:
  group: build-package-${{ github.event.inputs.crate }}-${{ github.event.inputs.version }}-${{ github.event.inputs.target_arch }}

we don't set cancel-in-progress so it shouldn't cancel any jobs that have already started when you enqueue a new one: only ones that are still pending (not started) according to https://docs.github.com/en/enterprise-cloud@latest/actions/writing-workflows/choosing-what-your-workflow-does/control-the-concurrency-of-workflows-and-jobs#using-concurrency-in-different-scenarios

Seems like we can just keep pushing build requests for the same package and it will attempt to build it at most twice. Seems fine.

In theory we could put a "do we already have a package for this?" check at the top of the builder, but it wouldn't save us from repeat build attempts of packages that fail, so it's probably not worth the bother.

[edit: eventually the exclude logic will kick in in the cronjob, so worst case, we do n+1 or maybe n+2 builds, where n is the limit of builds we allow in the exclude logic]

NobodyXu commented 2 months ago

would there be any value in including cargo-binstall's tree of fallback architectures here?

Honestly, I think we should just build the crate for all supported target anyway, but I can understand that you'd want to prevent hitting rate limit.

In that case, it makes sense to skip the fallback architecture.

can we make this simpler while keeping it fair and resilient to rate limit problems

We could utilise branching, to save our progress

We start by creating a branch and save all the crates to build there, and their version (cached)

When the "Build package" is done, we go to that branch and edit it, to mark that off the queue.

We already do this for failed crates, so I don't think we would add too many complexity to the code.

can we gain anything by doing things on a crate-by-crate basis (so we make the most of any crates.io index fetches we make)?

Yes definitely, if we can't do that, maybe we could cache the crates-io index fetch?

alsuren commented 2 months ago

We start by creating a branch and save all the crates to build there, and their version (cached)

this is kind-of a similar architecture to what I had originally - push the next crate name onto the list and commit it to the trigger/$arch branch (which is why it is named trigger/*). Pushing the whole queue on does sound like an interesting idea though. Would allow us to get rid of the randomness I guess.

Yes definitely, if we can't do that, maybe we could cache the crates-io index fetch?

I put an lru_cache around the crates.io fetch, but that will only save us work for within a single cronjob run. I suspect that cacheing for longer than that (github actions?) would end up more expensive than the crates.io fetch, with the added problem that it would be stale after an hour.

NobodyXu commented 2 months ago

Pushing the whole queue on does sound like an interesting idea though. Would allow us to get rid of the randomness I guess.

Yeah, and we can "save" the progress without having to do another github restful API, by saving crates built into the branch.

I suspect that cacheing for longer than that (github actions?) would end up more expensive than the crates.io fetch, with the added problem that it would be stale after an hour.

Yeah just caching it within cronjob will be fine.

alsuren commented 2 months ago

What are we looking for? 1) the amount of compute time that we waste trying to build any single fucked package has an upper bound 2) we should be able to respond within a small number of hours to a new version of a popular crate (that people are requesting from us) 3) the amount of compute time that we spend on checking and building packages should otherwise be shared equally among the top N popular crates on crates.io.

(Honestly, I would love it if 2) was actually on-demand while you wait, like how https://gobinaries.com works, but rust compile times won't make that super-easy)

I reckon the shuffled list of shuffled lists thing will be reasonably easy to implement, and doesn't require any re-architecting, so I will probably have a stab at that once we have clients deployed and reporting to the new stats server (sorry: I've not been working on the binstall part of this because something came up at home).

cargo-bins / cargo-quickinstall

Re-think scheduling of builds #284