Open TheKnarf opened 2 months ago
CI builds took almost 30 minutes to run.
Release build took over 30 minutes when building for all targets, less than 15 minutes after disabling everything except for the web build.
We intentionally made it do as much stuff as possible, assuming that most users would open source their code by default (which means infinite GitHub actions) or be ready to deal either the consequences of having a private repository, i.e. fiddling with the workflow.
That said, it would be fairly easy to enable or disable the builds based on a cargo generate
flag. How would you feel about that?
The CI builds should take like a minute at most when cached, and an uncached release takes about 10 minutes for me. Did something go wrong there? Do private repos get less powerful CI agents? @TimJentzsch cargo cache
works on private repos, right?
Under the hood it uses the GitHub Actions cache, I'm not aware of any restrictions for private repositories.
Yeah I'm curious if caching wasn't working, or otherwise why it was taking so long. Looking at my own game's CI, I see this error when trying to save the cache (https://github.com/benfrankel/blobo_party/actions/runs/10139032444):
zstd: error 70 : Write error : cannot write block : No space left on device
/usr/bin/tar: cache.tzst: Cannot write: Broken pipe
/usr/bin/tar: Child returned status 70
/usr/bin/tar: Error is not recoverable: exiting now
Warning: Failed to save: "/usr/bin/tar" failed with error: The process '/usr/bin/tar' failed with exit code 2
This started happening on the same CI run that spontaneously started recompiling from log
, making CI take 5x longer (~15 minutes). Which is very strange because the cargo --version
/ Cargo.toml
/ Cargo.lock
weren't touched as part of that commit. Maybe related:
This is with 5 saved caches at 3 GB each.
On the other hand, my releases were all under 10 minutes because I reused the same tag every time and hit the cache.
GitHub runners for private repos are smaller than for public repos, except macOS hosts private: https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for--private-repositories public: https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories
@mockersf thanks! That explains the performance issues then. Nothing we can fix, but it might be good to point it out in the docs.
That leaves the spontaneously broken cache, which sounds like a bug in either GitHub's cache or cargo cache
. @TimJentzsch, what do you think? Is there something we can do about this?
Edit: oh, looking at the logs, the machine simply ran out of space! Yeah, I've had that happen to me as well a couple of times. Don't think there's much we can do there other than restarting a build. Although you're right, this shouldn't invalidate the cache. Hmm.
Hmm thats interesting. GitHub should automatically delete old caches when the cache limit is being exceeded, but maybe it has a delay.
Maybe we can start optimizing the size of the cache though.
So I noticed that in the CI run I linked, the Docs
and Tests
jobs actually restored their cache from the Release workflow that ran right before as a fallback, then Docs
was able to save its cache but Tests
was not, so every subsequent run of Tests
used the Docs
cache instead. It seems like cache thrashing / out of cache space in my particular case.
So I noticed that in the CI run I linked, the
Docs
andTests
jobs actually restored their cache from the Release workflow that ran right before as a fallback, thenDocs
was able to save its cache butTests
was not, so every subsequent run ofTests
used theDocs
cache instead. It seems like cache thrashing / out of cache space in my particular case.
Ah shoot. We should probably remove the last cache key fallback. It doesnt make sense for Check to fallback to a cache from Test for example, because they need different builds anyway, it just increases the cache size for nothing.
Well I think the fallback itself is okay and shouldn't increase cache size, the problem is that Tests
was unable to save its own cache afterwards for some reason.
Unless using a fallback means that when the cache is saved at the end of the job, it will also include any irrelevant stuff it downloaded from the fallback, thus increasing cache sizes every time a fallback occurs? If that's the case... maybe it would be better to have no fallback at all.
Well I think the fallback itself is okay and shouldn't increase cache size, the problem is that
Tests
was unable to save its own cache afterwards for some reason.Unless using a fallback means that when the cache is saved at the end of the job, it will also include any irrelevant stuff it downloaded from the fallback, thus increasing cache sizes every time a fallback occurs? If that's the case... maybe it would be better to have no fallback at all.
Yes, the stuff from the fallbacks will also be saved again because its in the same folder. But without fallbacks I think we will have a cache miss too often... If we use the fallback from the same job it should be fine though. Unrelated jobs can double the cache size in the worst case
Could there be a situation where cache Z is a fallback from cache Y, which is a fallback from cache X, ... causing an ever-ballooning cache size over time? I wonder if there's a way to prune unused stuff from target/
, but I would be surprised.
Yes, I think in theory that's possible.
We should adjust the fallback in cargo-cache
a bit.
Otherwise I guess it's a good idea to clear your caches in CI occasionally.
Is it alright if I move this issue to cargo cache
?
IMO that makes sense for cache size specifically, but this issue should remain here as well since we ought to consider the fact that private runners have worse performance + are not free.
Would a solution be to ask you in cargo generate
which builds you want?
Yes, and a note in the workflows doc that mentions the issue with private runners, so users hopefully see that when they're setting up CI.
I created an issue for this on the cargo-cache
side: https://github.com/Leafwing-Studios/cargo-cache/issues/22
Here's another issue on the cargo-cache
side for the root cause / optimizing the cache size down even further: https://github.com/Leafwing-Studios/cargo-cache/issues/24
Remaining tasks:
A single game jam was all it took.
Too late I realized that I should disable the windows/linux/mac build since I only really needed the web build for the game jam. I think the template probably should disable those by default and require people to enable them if they need them.
Maybe there are other things one could do to reduce the amount of minutes it takes up? Maybe running
cargo fmt
andcargo test
isn't necessary by default on main. Maybe one could runcargo test
without triggering it parsing the docs folder? Are there other improvements one could make?