Swatinem / rust-cache

A GitHub Action that implements smart caching for rust/cargo projects
GNU Lesser General Public License v3.0
1.25k stars 102 forks source link

Cache own code with integrated cargo-sweep-like solution #43

Open Madoshakalaka opened 2 years ago

Madoshakalaka commented 2 years ago

cargo-sweep is a tool to be used around any build actions to timestamp and remove unvisited cache. It does so by checking the atime (access time) of the files in the target directory.

cargo sweep -s

<Insert any number of cargo build, cargo test etc...>

cargo sweep -f

With this method, incremental build can be left on, and build of own code can be cached for greater speed. Unfortunately, cargo-sweep doesn't publish any binary builds nor github actions.

37 mentions the benefits of not caching one's own code:

  1. it would only cache the first invocation when it generates a new cache key.
  2. with time the code will diverge more, causing longer builds.
  3. incremental needs a lot of disk space, and thus cache space and time to fetch & extract, lowering the benefits.
  1. is no longer an issue with the kind of smart diffing of cargo-sweep.
Madoshakalaka commented 2 years ago

Can we make this an option?

Swatinem commented 2 years ago

The first problem still remains, as you cannot update the cache, so it will still be ineffective. But a better solution for cleaning the cache (based on access times, or whatever) would be nice for sure. Iā€™m not quite up-to-date what the current way of using binary assets in actions is like, would need to look that up.

Madoshakalaka commented 2 years ago

how about using src code as part of the keys too? With the valina actions/cache it would be something like:

      - uses: actions/cache@v2
        with:
          path: |
           # ...
          key: cargo-${{ runner.os }}-${{ hashFiles('my-crate-directory/src/**') }}-${{ hashFiles('Cargo.lock') }}
          restore-keys: |
          # ... 

To make this work for workspaces, rust-cache can allow the users to specify the member crates to be hashed.

nipunn1313 commented 2 years ago

I use a workspace with many (~20) crates in a larger project. I'd certainly get benefit from having some of the lower level dependencies within our own crates cached. We use rust-cache to cache all our external dependencies, but there's still 2-3 minutes worth of build at the end to build our 20 crates.

I could imagine a solution to issue (1) by including say the timestamp of the week in the cache key - such that it's cleared every week (or day). I could also envision a solution where the src is part of the hash key, but it would be important for the restore-key to also be willing to try older ones, since src changes a lot.

It would be nice to be able to opt-in to caching ones own workspace code - for projects willing to make a different tradeoff than the default.

Something like cargo sweep would be a cherry on top, but for us, saving 2-3 minutes in the common case would be huge.

adam-azarchs commented 1 year ago

I think in this case the right answer is probably to have separate cache blobs for external vs. internal dependencies. External dependencies only change when (certain) changes are made to Cargo.toml or Cargo.lock, but internal dependencies change due to changes to source files (which keep in mind can absolutely include files other than .rs files because e.g. include_str! is a thing).

However there's a separate problem, which is that cargo will probably ignore the cache because it's idea of a "hash" of the source files includes the file modification timestamp, and that timestamp will be different every time you check out the code. This is a problem in general, but particularly for such internal dependencies. See https://github.com/rust-lang/cargo/issues/6529.

polarathene commented 7 months ago

For reference (presently only available in nightly): https://blog.rust-lang.org/2023/12/11/cargo-cache-cleaning.html

torokati44 commented 6 months ago

We use rust-cache to cache all our external dependencies, but there's still 2-3 minutes worth of build at the end to build our 20 crates.

+1

Something like cargo sweep would be a cherry on top, but for us, saving 2-3 minutes in the common case would be huge.

+1

The first problem still remains, as you cannot update the cache, so it will still be ineffective.

Always pushing a new cache, and generating/specifying a different (shorter) restore key, the latest cache with a matching prefix will be fetched.

In our (desired) model, workflow runs on master (after PRs got merged, or on a nightly schedule) would be the "source" of the cache - both for external dependencies; and for our own code (multiple large crates in the repo), with incremental compilation enabled - and the PR runs would be "consumers" of the cache, for speedy merge-ability checks.

polarathene commented 6 months ago

but there's still 2-3 minutes worth of build at the end to build our 20 crates.

Does that extra time not reproduce locally? (may be less on a local system that is more powerful than a runner)

Recently I was building Vector locally and exploring caching. Running cargo build again after the first build (without changing anything) still took quite a while. This was with 680+ crates compiled.

I still had 50s (with implicit default lto = false, aka crate local thin LTO?), 44s was due to the implicit LTO.

When building the full project (900+ crates, instead of my reduced feature set build):


You can also use linker-plugin-lto with Clang linker and Mold (actual linker via plugin) to also cache LTO (you need to ensure your clang and rust toolchain versions are compatible):

# .cargo/config.toml
# The `build` section applies for all targets:
[build]
rustflags = [
  # Set a linker like clang (affects link-arg compatibility):
  "-C", "linker=clang",
  # This one is required to enable the LTO caching option:
  "-C", "linker-plugin-lto",
  # This is important for linker-plugin-lto to work (note: `-flto=thin` / `-flto=full` have no effect here?):
  "-C", "link-arg=-flto",
  # By default this value is 0, which is 1 thread per physical core.
  # Use `all` to get 1 thread per logical core to use full CPU like rust normally would do:
  "-C", "link-arg=-flto-jobs=all",
  "-C", "link-arg=-Wl,--thinlto-cache-dir=lto-cache",
  # Actual linker to use:
  "-C", "link-arg=-fuse-ld=mold"
]

Some of the link-arg vary based on configured linker (-C link-arg=-fuse-ld=...), for mold the link-arg options match those documented for lld. Worth keeping in mind if you use something else instead of mold (lld / ld), or don't use clang (uses gcc then AFAIK which also has different options).

This adds a little overhead, but with the cache available, you get faster rebuilds.

Over 70% reduction! šŸŽ‰


Give it a try, perhaps it'll make a difference for you? :)

However there's a separate problem, which is that cargo will probably ignore the cache because it's idea of a "hash" of the source files includes the file modification timestamp, and that timestamp will be different every time you check out the code.

This concern has some workarounds?:

Earthly has a feature to optionally preserve timestamps when copying from host to image, noting it's important for incremental = true cache to work correctly, and the linked concern for git-restore-mtime may not apply to Github Actions, so it may be a valid workaround šŸ˜…


Unfortunately, cargo-sweep doesn't publish any binary builds nor github actions.

cargo binstall can work?:

# Grab and extract a static release of cargo-binstall:
curl -sSfL "https://github.com/cargo-bins/cargo-binstall/releases/latest/download/cargo-binstall-$(uname -m)-unknown-linux-musl.tgz" | tar -xz

# Install cargo-binstall into a proper `cargo binstall` command if useful afterwards, and install cargo-sweep:
# These will be installed to the default $CARGO_HOME (eg: /usr/local/cargo/bin/)
# If that location is part of the PATH env, the installed binaries will still be usable if $CARGO_HOME changes.
./cargo-binstall --quiet --no-confirm cargo-binstall cargo-sweep
rm ./cargo-binstall
NobodyXu commented 6 months ago

cargo binstall can work?:

You could use taiki-e/install-action, which automatically fallback to cargo-binstall for crates if doesn't natively support.

It would install cargo-binstall to $CARGO_HOME/bin so that it can be reused by taiki-e/install-action.

NobodyXu commented 6 months ago

I have totally no idea that lto cache exists, would be great if zig-cc supports it since we use it.

Wonder if rustc/cargo would expose such option.