Enable always writing cache to support hermetic build systems

wchargin commented 4 years ago

I’d like to use actions/cache to cache my Bazel build state, which includes dependencies that have been fetched, binaries and generated code that have been built, and results for tests that have run. Bazel is a hermetic build system, so the standard Bazel pattern is to always use a single cache. Bazel will take care of invalidation at a fine-grained level: if you only change one source file, it will only re-build and re-test targets that depend on that source file.

Thus, the pattern that makes sense to me for Bazel projects is to always fetch the cache and always store the cache. We can always fetch the cache by using a constant cache key, but then the cache will never be stored. Bazel doesn’t have a single package-lock.json-style file that can be used as a cache key; it’s the combination of all build and source files in the whole repository. We could key use the Git tree (or commit) hash as a cache key, but this would lead to storing a mountain of caches, too, which seems wasteful.

Ideally, the fetched cache would be taken from origin/master, but really taking it from any recent commit should be fine, even if that commit was in a broken or failing state.

On my repository, it takes 33 seconds to save the Bazel cache after a successful job, but on a clean cache it takes 2 minutes to fetch remote dependencies and 26 minutes to build all targets. I would be more than happy to pay those 33 seconds every time if it would save half an hour in the rest of the build!

For comparison, on Travis we achieve this by simply pointing to the Bazel cache directory: https://github.com/tensorflow/tensorboard/blob/1d1bd9a237fe23a3f2c31282ab44e7dfbcac717c/.travis.yml#L30-L32

chrispat commented 4 years ago

@wchargin this is an interesting topic thanks for bringing it up.

In this example if we had a way to skip storing the cache unless the run was on master you could use the git commit as part of your key and get the desired behavior without writing a new cache for each run of a pull request.

Do you think that would work for you?

wchargin commented 4 years ago

@chrispat: Yeah, that sounds reasonable! At a glance, I don’t see a way to save the cache only if it’s running on master… but perhaps I could hack something together that restores the cache to its initial state at the end of the job for builds that aren’t running on master—just as a proof of concept to see how this strategy works.

If I understand correctly, we’d still be proliferating caches with each commit to master, right? I understand that cache eviction kicks in, but it still seems unfortunate, especially if I have to worry about other caches (e.g., node_modules) being evicted prematurely.

hvr commented 4 years ago

For the record, cabal's Nix-style store/cache also falls into this category; see my comment at https://github.com/actions/cache/pull/38#discussion_r343915940

chrispat commented 4 years ago

@wchargin given the version of the sources is part of the bazel caching algorithm what key do you think should be used to prevent a huge number of updates? My assumption is travis is uploading new caches essentially every build if they are just looking at changes to the cache directory.

wchargin commented 4 years ago

Yes, Travis uploads new caches every build. And you’re right that this is a performance problem: Travis re-uploads the entire cache directory from scratch every build, which can take minutes. (Also, the build doesn’t report success until this upload has completed, and this upload can cause an otherwise successful build to time out and fail, which is super frustrating…)

We do want to update the cache on every build, but it should be cheap to perform a partial update of only files that changed, rsync-style. The action cache will be updated on basically every commit, but is small (~500K). The fetch caches will be very rarely updated, and can be large (hundreds of MB). And the build cache for any given target will be updated whenever that target changes, but not only if unrelated targets change, and can be of varying sizes (typically fairly small, but there are lots of them).

I see that actions/cache currently tars and gzips everything into a single bundle, but it would be much more effective for caches in the style of Bazel/Nix/Cabal to support incremental updates, perhaps by using use a content-addressable store like that of Git itself. What do you think?

chrispat commented 4 years ago

For something like bazel I wonder if having a truly remote cache is actually a better option https://github.com/buchgr/bazel-remote. This is not something we are going to get around to implementing anytime soon but it is something we can consider for the future.

The model we have for caching enables to user to control the key and also requires that all caches are immutable by key. While that is not ideal for all scenarios it does work generally well for a large number of different technology stacks and scenarios. This immutable nature make incremental update untenable and likely not possible. Even if we could incrementally update the cache the download on next run is going to have to be the entire cache as we have to provision a fresh VM for each job.

mborgerson commented 4 years ago

I believe I have a similar use case to the issue described here, and ideally would like to see an update-cache option added to the action, but I've worked around the issue by leveraging the restore-keys option.

A project of mine consists largely of C files, and naturally a significant portion of my CI cycle time is spent in compilation. To speed things up, I've employed ccache, which will opportunistically recycle previously built object files when it detects that the compilation would be the same for the current build. This has a dramatic performance improvement on CI times. In order to do this though, I need some persistence of storage between workflow runs in order to save and restore ccache's cache directory. Of course, as the code base evolves, the cache of object files will change too.

I was pleased to discover actions/cache, as it fits my use case very nicely; but, I was surprised to find that when a cache hit occurs, actions/cache will not attempt to update the cache at all, and there's not an option to request such update.

To work around this, I do the following:

    - name: Initialize Compiler Cache
      id: cache
      uses: actions/cache@v1
      with:
        path: /tmp/xqemu-ccache
        key: cache-${{ runner.os }}-${{ matrix.configuration }}-${{ github.sha }}
        restore-keys: cache-${{ runner.os }}-${{ matrix.configuration }}-

It works like this: when the cache is loaded for a workflow, there will be an initial cache miss because the cache key contains the current commit sha. actions/cache will fall back to the most recently added cache via restore-keys prefix matching policy, then after the build has completed, create a new cache entry to satisfy the initial cache miss.

This solution seems to work very well for me, and hopefully this will be useful to others with a similar use case. Ideally though, I think actions/cache should just support updating the cache, to a new immutable revision perhaps--as I have done above.

wchargin commented 4 years ago

Having the caches be immutable makes a lot of sense. Immutable caches seem perfectly compatible with incremental updates—in fact, this is a strong point of Git. If your repository has 100 top-level directories each with 100 files, then you have 101 trees and 10000 blobs; if you change just one of those files, then you have 103 trees and 10001 blobs, not 202 trees and 20000 blobs. Does this make sense, or am I missing something?

A truly remote cache is an appealing option, but comes with a lot more operational overhead for the user. Storing files is much easier than running a server.

Downloading the full latest cache on each run may not be perfect, but it’s still an improvement over rebuilding all the artifacts, faster by about 20 minutes in my case.

mborgerson commented 4 years ago

@wchargin I agree that immutability is acceptable on the condition that we can restore and create a new cache as I have described above (though it can be quite wasteful as you mentioned). My guess is that this particular use case will be desirable by many projects. Perhaps the documentation could simply be updated to demonstrate this type of use case? To me, it wasn't immediately obvious. My suggestion would be to mention using ${{ github.sha }} in the key.

wchargin commented 4 years ago

Right; immutability is space-wasteful if the caches are stored independently (which will happen if you use ${{ github.sha }}) but not if they’re stored as part of one content-addressed store (which would require changes to the actions/cache implementation).

chrispat commented 4 years ago

A truly remote cache is an appealing option, but comes with a lot more operational overhead for the user. Storing files is much easier than running a server.

I was thinking we would run that server on behalf of the user so the operational overhead should be essentially the same is it would be for the existing cache action. I am not 100% sure that is the best option but it seems like it might be a really good one for build systems that support it.

wchargin commented 4 years ago

A truly remote cache is an appealing option, but comes with a lot more operational overhead for the user. Storing files is much easier than running a server.

I was thinking we would run that server on behalf of the user so the operational overhead should be essentially the same is it would be for the existing cache action.

Oh, that would be fantastic! —being able to just point Bazel to a remote cache provided by a GitHub-managed action would be a huge value-add for us compared to other CI services.

dsilva commented 4 years ago

Similar use case: ~/.cache/sccache for sccache works like the bazel cache. For now it's probably easier to point sccache at S3 or GCS to avoid the issues described above, and it would be nice if GitHub ran an sccache store as well.

jacquesbh commented 4 years ago

Hi!

I have the same issue I think with composer cache, in PHP.

Composer saves every download it makes in cache and most of the time this cache is globally used. As example on a developer machine the cache is growing all the time with new releases.

I've been using the github.sha in my cache keys since it allows me to re-save the cache and avoid the case where it hits the cache but new versions of dependencies exist and it's always downloading them since they're not in cache.

jvolkman commented 4 years ago

Specifically for Bazel: the cache protocol is pretty simple. I wonder if it would be feasible to write a service that simply proxies to Github's own artifactcache service and stand it up locally? Not sure how many cache keys Github allows.

mikekaminsky commented 4 years ago

Adding onto this with a different use-case: I use renv with R for package management and I would like to be able to have the cache get updated if the build succeeds. Since renv will update whatever packages need to be updated based on the renv.lock file, I'd like to simply update the cache every time that runs.

This is important because for the project and I'm working on rebuilding all of the dependencies takes ~50 minutes, so fully invalidating the cache is really costly -- it'd be great to be able to "update" the cache (by overwriting it) on successful runs.

Additionally, I just added a new (very minor) thing to be cached in a different location, and in order to start using it I have to rebuild the entire cache and pay the 50 minute penalty. The update-cache workflow here would be very helpful.

mzabaluev commented 4 years ago

I would favor #135 (immutable slots, but the key is computed on the save step, so it can hash resulting state) as a more efficient solution. Choice of the key to designate an exact hit of the cache state is up to the developer: for build systems that don't have .lock style files it could use e.g. a hash of the ls -lR output of the build input directories.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 365 days with no activity. Leave a comment to avoid closing this issue in 5 days.

ephemient commented 2 years ago

⠀

pramodka-revefi commented 2 years ago

Since this task is still open, What's the current best practice for bazel caching + github actions? Does someone have a snippet of their github workflow they can share?

Update: Just sharing the CI pipeline yaml with caching that we went with, hopefully it'll help the next person who lands on this task. (Slightly more permissive approach to @nanddalal above)

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 200 days with no activity. Leave a comment to avoid closing this issue in 5 days.

mihaimaruseac commented 1 year ago

I still think this is useful to have and not close

yongtang commented 1 year ago

I think this will also help reduce the overall cost of compute resources on GitHub actions, as many open source projects can minimize the GitHub actions minutes they use for every run.

jsoref commented 1 year ago

So... If you're willing to use an a/b system, you could probably do something like:

- uses: actions/cache/restore@v3
  with:
    key: preferred
    restore-key: fallback
- run: do-work
- if: no-cache
  uses: actions/cache/save@v3
  with:
    key: fallback
- if: no-cache
  uses: actions/cache/save@v3
  with:
    key: preferred
- if: used-preferred-cache
  uses: ./delete-cache
  with:
    key: fallback
- if: used-preferred-cache
  uses: actions/cache/save@v3
  with:
    key: fallback
- if: used-fallback-cache
  uses: actions/cache/save@v3
  with:
    key: preferred
- if: used-preferred-cache
  uses: ./delete-cache
  with:
    key: preferred
- if: used-preferred-cache
  uses: actions/cache/save@v3
  with:
    key: preferred

Notes:

used-fallback-cache, used-preferred-cache, and no-cache aren't technical things, but actions/cache has outputs that you can use to construct the concepts
You might be able to condense the various stages, but you definitely want to ensure that at least one of the caches is available, and given how simple the with:'s will be, it might be simpler just to have lots of steps than to try to be incredibly fancy about it.
in addition to wrapping delete-cache into an action, you could wrap the entire delete+save pattern into an action

./delete-cache can be implemented using the APIs that were made available in circa June 27, 2022: https://github.blog/changelog/2022-06-27-list-and-delete-caches-in-your-actions-workflows/

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 200 days with no activity. Leave a comment to avoid closing this issue in 5 days.

Frenzie commented 1 year ago

Bots suck.

IanButterworth commented 9 months ago

I think this can be closed as it's now released in v4

jsoref commented 9 months ago

I don't see how v4 changes anything. Either it was already possible (and I think my suggestions and others show that there are ways to do something) or it might still not be possible.

If it's now possible as of v4, it'd be nice if someone put together an actual example of how to do it.

IanButterworth commented 9 months ago

My bad. v4 has a save-always option. But this would be more like a save-overwrite option?

jsoref commented 9 months ago

I mean, I'd probably just use an epoch time value with a fallback of none:

        key: cache-${{ steps.time.outputs.epoch }}
        restore-keys: cache-

That'd result in it always writing one. Older caches will get wiped out as they become least recently used. Sure, you pay a bit to store a duplicate of the cache (or you could use actions/cache/restore and actions/cache/save and only conditionally call actions/cache/save if you made any changes...), but, so what?

ephemient commented 9 months ago

but, so what?

That excess space usage causes other caches to get dropped too.

jsoref commented 9 months ago

Then use restore & saveseparately and use an if: to only use save when you have changes.

If you're being really aggressive, you might be able to portion the cache into lots of pieces and have steps to calculate and retrieve/save them.

There will be a trade-off between how many steps you need to run and how big your cache pieces are.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 200 days with no activity. Leave a comment to avoid closing this issue in 5 days.

Frenzie commented 3 months ago

https://github.com/c-hive/fresh-bot

mihaimaruseac commented 3 months ago

This is still an issue worth resolving

actions / cache

Enable always writing cache to support hermetic build systems #109