bazel-contrib / setup-bazel

GitHub Action to configure Bazel
MIT License
16 stars 5 forks source link

Does this action save Bazel's built output artifacts to a cache? #18

Open seh opened 1 month ago

seh commented 1 month ago

For several years I've seen advice for caching Bazel's built artifacts in GitHub Actions by using actions/cache and including ~/.cache/bazel as one of the captured directory paths. When doing so, with the right cache key and set of hierarchical "restore keys", we can coax Bazel into reusing a lot of what it's already built or tested and avoid it needing to run actions in subsequent workflow runs when the action inputs haven't changed.

Using this setup-bazel action, I see that we can enable a repository cache and a disk cache, and the action takes care of setting Bazel's output base directory, but it doesn't appear to save the built artifacts to a cache. My reading of the disk cache documentation suggests that it includes "build artifacts", but even when I enable setup-bazel's use of the disk cache and I see my GitHub Actions workflow run restore the disk cache successfully, it still appears that Bazel winds up running many actions for which I expected to find the outputs already available in the cache.

Do I need to use actions/cache separately to cache more of these action outputs, or should the disk cache configured by setup-bazel already take care of that?

p0deje commented 1 month ago

No, there should be no need to use actions/cache if you use setup-bazel. Once you enable disk-cache, setup-bazel should be saving your build outputs in the cache so you don't have to worry about re-building everything. If for some reason it doesn't work as you'd expect, please provide more details - maybe there is a bug or a misconfiguration on your side.

A common problem would be to use the same disk cache for different jobs/workflows so the last one that runs would override the disk cache of the others. This can be easily solved via https://github.com/bazel-contrib/setup-bazel#separate-disk-caches-between-workflows.

seh commented 1 month ago

Thank you explaining that. Now I think I was misinterpreting the problem.

As I understand the current design, the cache name suffix for the disk cache is the hash of all of my BUILD and BUILD.bazel files. If my BUILD.bazel files settle down but I keep on changing, say, my Go files in each set of new Git commits pushed to my pull request's branch, then setup-bazel finds an existing cache for this set of BUILD.bazel files, restores it, and doesn't bother saving a new cache that then reflects the built artifacts from the current Go files. If that's true, then I'll keep on starting out with the same restored cache even as my source files drift away from their prior state when we last changed our BUILD.bazel files.

I recognize that an alternate approach of saving a new cache for each distinct Git commit is also expensive; it winds up taking a very long time to save the cache, and you wind up grinding through your cache quota quickly, evicting older and still-useful caches that are used for different purposes.

Is there an approach between these positions that you have considered? Do you see the current approach as having the liability I've described here, or am I using it incorrectly and suffering unduly?

p0deje commented 1 month ago

I am honestly not sure what would be the best approach here. We could potentially list out the tree of disk cache and upload individual pieces (or folders):

$ tree /Users/p0deje/.cache/bazel-disk
/Users/p0deje/.cache/bazel-disk/cas/
├── 00
│   ├── 000020e27f52efd462f08d678435bd2825371906794c804d4f38c8d0d6db7506
│   ├── 0000dce2b2b50b507467ebef705ac2962cb0612d13ffb2a2bd8c8563ffc4594a
│   ├── 000264bb97b837f0da9f3f2c9e89138ddb5857a3c5cafe2c2c3249805306a98d
│   ├── 00033deeb0323f9f6490f75dadbeeec141114710b029e6e175b1c1e146f2618d
│   ├── 0003aa49d5eccb6c2e50b5017f5ac9e82c2335f7feaa202b54b3835f9bc3ae89
│   ├── 0005b569d8449cdae01c26171d5e6bb6fbdc9cd1af7bb575bae6ad8bd4e385d7
│   ├── 00083d81014fe8f6cab9fab6f126983e8ad0466784eed2204a71b3a0a2d37807
│   ├── 00086d30431bf78669d822ab30f718620afb75c25cea798e241adbb38ec175fb
...

However, it would probably be too network-heavy. It's also not clear how such a cache should be cleared to avoid overgrowing it.

Another approach would be to force-overwrite the cache even when it's been hit. It could prove useful when running on the main branch, as PRs to the main branch could fetch it but not save it:

- uses: bazel-contrib/setup-bazel@0.8.1
  with:
    disk-cache: true
    disk-cache-always-save: ${{ github.ref == 'refs/heads/main' }}
seh commented 1 month ago

An approach that I used with a previous project overlapped with these ideas.

Out on topic branches for PRs, I used actions/cache with a cache key as follows: my-name-${{ runner.os }}-${{github.ref}}-${{github.sha}}

That creates a new cache for every distinct Git commit at the head of the topic branch. For my "restore keys", I used a cascading sequence back down to the cache that might be saved against the repository's base branch ("main"):

That is, if the head commit on the topic branch changed, look for the most recent cache from this same branch. Failing that, look for the most recent cache from the base branch.

Now, I had a separate GitHub Actions workflow that ran on pushes to the "main" branch, meaning whenever we'd merge a PR against it. That workflow also used actions/cache to build all the same artifacts as those used in the aforementioned workflow for PRs. Its cache key was my-name-${{ runner.os }}-${{github.sha}}. Its lone "restore key" was my-name-${{ runner.os }}-. Since GitHub only allows restoring from caches along the same branch or a base branch, creating these caches from the "main" branch was preparing to supply later PRs.

Between the PR-focused and this push-to-"main" workflow, these caches worked nicely to provide mostly fresh Bazel action output and built artifacts to subsequent workflow runs. They came with a few liabilities, though:

p0deje commented 1 month ago

Thank you for explaining your setup. I believe these all are workaround to a fundamental limitation of current cache implementation - it's coarse-grained. The best solution to this problem I can think of is to implement a small HTTP server that is started by the action, which supports being used as Bazel remote caching (https://bazel.build/remote/caching#http-caching) and it would internally be backed by GitHub Actions cache. This would allow storing the fine-grained caches in GHA and also avoid over-downloading. A similar approach is taken by buildkit (https://github.com/moby/buildkit/blob/master/cache/remotecache/gha/gha.go) in which they translate Docker build cache into GHA cache.

Unfortunately, I don't have time to work on this at the moment, but if anyone is up to implementing this, I would be happy to expand on how it might work.

For now, we can work around the fundamental limitation by saving cache only on a main branch or allowing to customize the cache-key and restore-keys to support scenarios described by @seh.

p0deje commented 1 month ago

I had some time to prototype an idea of running an HTTP server as part of an action is compatible with Bazel remote caching. The server does not store anything, but simply translates Bazel remote caching REST API calls to @actions/cache API calls, essentially delegating storing and retrieving cache files to GitHub Actions cache.

In simple examples it worked great and created hundreds of cache entries that would match Bazel outputs. Upon re-running the build, I could see that remote cache is being used for build/test.

However, when I started testing more complex scenarios with thousands of cache entries, I obviously stumbled upon rate limiting of GitHub Actions cache API and what seemed to be unhandled rate limits directly from Microsoft Azure Storage. I could still see cache created and available eventually, but the builds were failing with "Missing digest for cache" errors.

Screenshot 2024-06-01 at 20 31 44

Unfortunately, I don't have time to dig more into this and build something robust and available for general use. I also had issues with remote cache on Windows, which I am not sure about - whether the errors are in Bazel itself or in my implementation.

The proof-of-concept can be seen in #21. If anyone wants to pick it up and continue working, I'll be happy to collaborate.

bentekkie commented 1 week ago

Would a PR to implement a disk-cache-always-save like below be accepted?

- uses: bazel-contrib/setup-bazel@0.8.1
  with:
    disk-cache: true
    disk-cache-always-save: ${{ github.ref == 'refs/heads/main' }}
p0deje commented 1 week ago

@bentekkie Yes, but let's ensure the API is correct. We need a way to disable uploading caches from PRs, isn't that what we want?

- uses: bazel-contrib/setup-bazel@0.8.1
  with:
    disk-cache: true
    disk-cache-save: ${{ github.ref == 'refs/heads/main' }}
bentekkie commented 1 week ago

Shouldn't that be a decision left to users? With an option like this a user can use a condition like in the example here to only run for main, but they could also change that condition if they want to restrict based on other variables

p0deje commented 1 week ago

@bentekkie Yes, I'm just saying that it should be called disk-cache-save rather than disk-cache-always-save. The latter assume that it might still be saved even if condition is false.

bentekkie commented 1 week ago

Ah missed that, sounds good, I will try and send a pr for this