Open wchargin opened 4 years ago
@wchargin this is an interesting topic thanks for bringing it up.
In this example if we had a way to skip storing the cache unless the run was on master you could use the git commit as part of your key and get the desired behavior without writing a new cache for each run of a pull request.
Do you think that would work for you?
@chrispat: Yeah, that sounds reasonable! At a glance, I don’t see a way
to save the cache only if it’s running on master… but perhaps I could
hack something together that restores the cache to its initial state at
the end of the job for builds that aren’t running on master
—just as
a proof of concept to see how this strategy works.
If I understand correctly, we’d still be proliferating caches with each
commit to master, right? I understand that cache eviction kicks in, but
it still seems unfortunate, especially if I have to worry about other
caches (e.g., node_modules
) being evicted prematurely.
For the record, cabal's Nix-style store/cache also falls into this category; see my comment at https://github.com/actions/cache/pull/38#discussion_r343915940
@wchargin given the version of the sources is part of the bazel caching algorithm what key do you think should be used to prevent a huge number of updates? My assumption is travis is uploading new caches essentially every build if they are just looking at changes to the cache directory.
Yes, Travis uploads new caches every build. And you’re right that this is a performance problem: Travis re-uploads the entire cache directory from scratch every build, which can take minutes. (Also, the build doesn’t report success until this upload has completed, and this upload can cause an otherwise successful build to time out and fail, which is super frustrating…)
We do want to update the cache on every build, but it should be cheap to
perform a partial update of only files that changed, rsync
-style. The
action cache will be updated on basically every commit, but is small
(~500K). The fetch caches will be very rarely updated, and can be large
(hundreds of MB). And the build cache for any given target will be
updated whenever that target changes, but not only if unrelated targets
change, and can be of varying sizes (typically fairly small, but there
are lots of them).
I see that actions/cache
currently tars and gzips everything into a
single bundle, but it would be much more effective for caches in the
style of Bazel/Nix/Cabal to support incremental updates, perhaps by
using use a content-addressable store like that of Git itself. What do
you think?
For something like bazel I wonder if having a truly remote cache is actually a better option https://github.com/buchgr/bazel-remote. This is not something we are going to get around to implementing anytime soon but it is something we can consider for the future.
The model we have for caching enables to user to control the key and also requires that all caches are immutable by key. While that is not ideal for all scenarios it does work generally well for a large number of different technology stacks and scenarios. This immutable nature make incremental update untenable and likely not possible. Even if we could incrementally update the cache the download on next run is going to have to be the entire cache as we have to provision a fresh VM for each job.
I believe I have a similar use case to the issue described here, and ideally would like to see an update-cache
option added to the action, but I've worked around the issue by leveraging the restore-keys
option.
A project of mine consists largely of C files, and naturally a significant portion of my CI cycle time is spent in compilation. To speed things up, I've employed ccache, which will opportunistically recycle previously built object files when it detects that the compilation would be the same for the current build. This has a dramatic performance improvement on CI times. In order to do this though, I need some persistence of storage between workflow runs in order to save and restore ccache's cache directory. Of course, as the code base evolves, the cache of object files will change too.
I was pleased to discover actions/cache, as it fits my use case very nicely; but, I was surprised to find that when a cache hit occurs, actions/cache will not attempt to update the cache at all, and there's not an option to request such update.
To work around this, I do the following:
- name: Initialize Compiler Cache
id: cache
uses: actions/cache@v1
with:
path: /tmp/xqemu-ccache
key: cache-${{ runner.os }}-${{ matrix.configuration }}-${{ github.sha }}
restore-keys: cache-${{ runner.os }}-${{ matrix.configuration }}-
It works like this: when the cache is loaded for a workflow, there will be an initial cache miss because the cache key contains the current commit sha. actions/cache will fall back to the most recently added cache via restore-keys
prefix matching policy, then after the build has completed, create a new cache entry to satisfy the initial cache miss.
This solution seems to work very well for me, and hopefully this will be useful to others with a similar use case. Ideally though, I think actions/cache should just support updating the cache, to a new immutable revision perhaps--as I have done above.
Having the caches be immutable makes a lot of sense. Immutable caches seem perfectly compatible with incremental updates—in fact, this is a strong point of Git. If your repository has 100 top-level directories each with 100 files, then you have 101 trees and 10000 blobs; if you change just one of those files, then you have 103 trees and 10001 blobs, not 202 trees and 20000 blobs. Does this make sense, or am I missing something?
A truly remote cache is an appealing option, but comes with a lot more operational overhead for the user. Storing files is much easier than running a server.
Downloading the full latest cache on each run may not be perfect, but it’s still an improvement over rebuilding all the artifacts, faster by about 20 minutes in my case.
@wchargin I agree that immutability is acceptable on the condition that we can restore and create a new cache as I have described above (though it can be quite wasteful as you mentioned). My guess is that this particular use case will be desirable by many projects. Perhaps the documentation could simply be updated to demonstrate this type of use case? To me, it wasn't immediately obvious. My suggestion would be to mention using ${{ github.sha }}
in the key.
Right; immutability is space-wasteful if the caches are stored
independently (which will happen if you use ${{ github.sha }}
) but not
if they’re stored as part of one content-addressed store (which would
require changes to the actions/cache
implementation).
A truly remote cache is an appealing option, but comes with a lot more operational overhead for the user. Storing files is much easier than running a server.
I was thinking we would run that server on behalf of the user so the operational overhead should be essentially the same is it would be for the existing cache action. I am not 100% sure that is the best option but it seems like it might be a really good one for build systems that support it.
A truly remote cache is an appealing option, but comes with a lot more operational overhead for the user. Storing files is much easier than running a server.
I was thinking we would run that server on behalf of the user so the operational overhead should be essentially the same is it would be for the existing cache action.
Oh, that would be fantastic! —being able to just point Bazel to a remote cache provided by a GitHub-managed action would be a huge value-add for us compared to other CI services.
Similar use case: ~/.cache/sccache
for sccache works like the bazel cache. For now it's probably easier to point sccache at S3 or GCS to avoid the issues described above, and it would be nice if GitHub ran an sccache store as well.
Hi!
I have the same issue I think with composer
cache, in PHP.
Composer saves every download it makes in cache and most of the time this cache is globally used. As example on a developer machine the cache is growing all the time with new releases.
I've been using the github.sha in my cache keys since it allows me to re-save the cache and avoid the case where it hits the cache but new versions of dependencies exist and it's always downloading them since they're not in cache.
Specifically for Bazel: the cache protocol is pretty simple. I wonder if it would be feasible to write a service that simply proxies to Github's own artifactcache
service and stand it up locally? Not sure how many cache keys Github allows.
Adding onto this with a different use-case: I use renv
with R
for package management and I would like to be able to have the cache get updated if the build succeeds. Since renv
will update whatever packages need to be updated based on the renv.lock
file, I'd like to simply update the cache every time that runs.
This is important because for the project and I'm working on rebuilding all of the dependencies takes ~50 minutes, so fully invalidating the cache is really costly -- it'd be great to be able to "update" the cache (by overwriting it) on successful runs.
Additionally, I just added a new (very minor) thing to be cached in a different location, and in order to start using it I have to rebuild the entire cache and pay the 50 minute penalty. The update-cache workflow here would be very helpful.
I would favor #135 (immutable slots, but the key is computed on the save step, so it can hash resulting state) as a more efficient solution. Choice of the key to designate an exact hit of the cache state is up to the developer: for build systems that don't have .lock
style files it could use e.g. a hash of the ls -lR
output of the build input directories.
This issue is stale because it has been open for 365 days with no activity. Leave a comment to avoid closing this issue in 5 days.
⠀
Since this task is still open, What's the current best practice for bazel caching + github actions? Does someone have a snippet of their github workflow they can share?
Update: Just sharing the CI pipeline yaml with caching that we went with, hopefully it'll help the next person who lands on this task. (Slightly more permissive approach to @nanddalal above)
This issue is stale because it has been open for 200 days with no activity. Leave a comment to avoid closing this issue in 5 days.
I still think this is useful to have and not close
I think this will also help reduce the overall cost of compute resources on GitHub actions, as many open source projects can minimize the GitHub actions minutes they use for every run.
So... If you're willing to use an a/b system, you could probably do something like:
- uses: actions/cache/restore@v3
with:
key: preferred
restore-key: fallback
- run: do-work
- if: no-cache
uses: actions/cache/save@v3
with:
key: fallback
- if: no-cache
uses: actions/cache/save@v3
with:
key: preferred
- if: used-preferred-cache
uses: ./delete-cache
with:
key: fallback
- if: used-preferred-cache
uses: actions/cache/save@v3
with:
key: fallback
- if: used-fallback-cache
uses: actions/cache/save@v3
with:
key: preferred
- if: used-preferred-cache
uses: ./delete-cache
with:
key: preferred
- if: used-preferred-cache
uses: actions/cache/save@v3
with:
key: preferred
Notes:
used-fallback-cache
, used-preferred-cache
, and no-cache
aren't technical things, but actions/cache has outputs that you can use to construct the conceptswith:
's will be, it might be simpler just to have lots of steps than to try to be incredibly fancy about it.delete-cache
into an action, you could wrap the entire delete+save pattern into an action./delete-cache can be implemented using the APIs that were made available in circa June 27, 2022: https://github.blog/changelog/2022-06-27-list-and-delete-caches-in-your-actions-workflows/
This issue is stale because it has been open for 200 days with no activity. Leave a comment to avoid closing this issue in 5 days.
Bots suck.
I think this can be closed as it's now released in v4
I don't see how v4 changes anything. Either it was already possible (and I think my suggestions and others show that there are ways to do something) or it might still not be possible.
If it's now possible as of v4, it'd be nice if someone put together an actual example of how to do it.
My bad. v4 has a save-always
option. But this would be more like a save-overwrite
option?
I mean, I'd probably just use an epoch time value with a fallback of none:
key: cache-${{ steps.time.outputs.epoch }}
restore-keys: cache-
That'd result in it always writing one. Older caches will get wiped out as they become least recently used. Sure, you pay a bit to store a duplicate of the cache (or you could use actions/cache/restore
and actions/cache/save
and only conditionally call actions/cache/save
if you made any changes...), but, so what?
but, so what?
That excess space usage causes other caches to get dropped too.
Then use restore
& save
separately and use an if:
to only use save
when you have changes.
If you're being really aggressive, you might be able to portion the cache into lots of pieces and have steps to calculate and retrieve/save them.
There will be a trade-off between how many steps you need to run and how big your cache pieces are.
This issue is stale because it has been open for 200 days with no activity. Leave a comment to avoid closing this issue in 5 days.
This is still an issue worth resolving
I’d like to use
actions/cache
to cache my Bazel build state, which includes dependencies that have been fetched, binaries and generated code that have been built, and results for tests that have run. Bazel is a hermetic build system, so the standard Bazel pattern is to always use a single cache. Bazel will take care of invalidation at a fine-grained level: if you only change one source file, it will only re-build and re-test targets that depend on that source file.Thus, the pattern that makes sense to me for Bazel projects is to always fetch the cache and always store the cache. We can always fetch the cache by using a constant cache key, but then the cache will never be stored. Bazel doesn’t have a single
package-lock.json
-style file that can be used as a cache key; it’s the combination of all build and source files in the whole repository. We could key use the Git tree (or commit) hash as a cache key, but this would lead to storing a mountain of caches, too, which seems wasteful.Ideally, the fetched cache would be taken from
origin/master
, but really taking it from any recent commit should be fine, even if that commit was in a broken or failing state.On my repository, it takes 33 seconds to save the Bazel cache after a successful job, but on a clean cache it takes 2 minutes to fetch remote dependencies and 26 minutes to build all targets. I would be more than happy to pay those 33 seconds every time if it would save half an hour in the rest of the build!
For comparison, on Travis we achieve this by simply pointing to the Bazel cache directory: https://github.com/tensorflow/tensorboard/blob/1d1bd9a237fe23a3f2c31282ab44e7dfbcac717c/.travis.yml#L30-L32