Caching tracking issue - Githubissues

svartalf commented 4 years ago

There are two things that can be improved with caching added:

Cache target/ and Cargo folders (see an example)
Cache the compiled binary from the cargo install {pkgid} command

Right now this affects the following workflows:

actions-rs/cargo with use-cross: true input enabled (cross is compiled on each call)
https://github.com/actions-rs/tarpaulin/pull/1
https://github.com/actions-rs/meta/issues/20
https://github.com/actions-rs/meta/issues/13
https://github.com/actions-rs/meta/issues/11
https://github.com/actions-rs/meta/issues/5

This issue can be solved by a proper implementation of the actions-rs/core/Cargo.installCached function, see https://github.com/actions-rs/core/issues/31, but it is blocked by https://github.com/actions/cache/issues/55

Right now this issue acts only as a tracking issue for a possibility of proper caching implementation.

CAD97 commented 4 years ago

In rust-analyzer, we remove artifacts for the workspace-local files before the cache stores. While this isn't required to reduce cache churn like it does on auto-keyed caching (e.g. Travis) (as a matching key will not store the cache), it does still reduce cache sizes.

If actions-rs ever provides a "best practice" caching wrapper, it should probably allow scrubbing these files, so only dependencies' artifacts are cached. (Specifically, scrubbing files directly in ./target/{debug,release}/ as well as folders/files directly in ./target/{debug,release/{.fingerprint,deps,incremental}/ that contain the names of workspace-local libs/bins.

svartalf commented 4 years ago

@CAD97 noted, thank you! I do want to create some kind of actions-rs/cache, which will cache all these things automatically, but as this issue points to, there is no programmatic access to the cache routines right now.

anp commented 4 years ago

Lucky timing, but looks like a few hours before I came across this issue someone posted a possible workaround to the upstream issue: https://github.com/actions/cache/issues/55#issuecomment-612575988.

Also worth noting that in addition to the target/ directory examples, currently I have these cache keys set up in my jobs:

    - name: cache cargo index
      uses: actions/cache@v1
      with:
        path: ~/.cargo/git
        key: cargo-index-${{ hashFiles('**/Cargo.lock') }}
        restore-keys: |
          cargo-index-
    - name: cache cargo registry
      uses: actions/cache@v1
      with:
        path: ~/.cargo/registry
        key: cargo-registry-${{ hashFiles('**/Cargo.lock') }}
        restore-keys: |
          cargo-registry-

thomaseizinger commented 4 years ago

There is an initial version of @actions/cache available now: https://github.com/actions/toolkit/tree/master/packages/cache

mzabaluev commented 4 years ago

One problem with using a hash of Cargo.lock in the cache key is that for library crates, checking Cargo.lock into version control is discouraged. So there may not be enough data in the repository checkout so that the cache action could score the hit with the dependencies exactly right for the build, before cargo fetch or cargo update generates Cargo.lock locally. But these commands also fetch the dependencies, so they would benefit from some hopefully close snapshot of ~/.cargo/registry to be restored from the cache already.

I have developed a workflow solution that maintains a Cargo.lock checked in and automatically updated in a special git branch. In every workflow run that uses cargo to build, Cargo.lock is grafted on top of the commit that triggered the build, and then the hash of Cargo.lock is used as the last part of the cache key for restoring ~/.cargo/registry. The hope is that the lockfile generated off the latest master tree often suits the build of a somewhat divergent revision with no new dependency crates to fetch. This is probably too much complexity for a single GitHub action to hide, though.

CAD97 commented 4 years ago

for library crates, checking Cargo.lock into version control is discouraged.

This is true, but I both do and don't understand why it's the case. If you're caching the build anyway, you're (probably) going to be caching a Cargo.lock anyway. Committing a Cargo.lock for a library is good for libraries imho for the exact same reasons it would be good for binaries.

So there may not be enough data in the repository checkout so that the cache action could score the hit with the dependencies exactly right for the build, before cargo fetch or cargo update generates Cargo.lock locally. But these commands also fetch the dependencies, so they would benefit from some hopefully close snapshot of ~/.cargo/registry to be restored from the cache already.

You probably want cargo generate-lockfile then. Also, general knowledge that I've absorbed is that .cargo/registry isn't worth caching because 1) it costs a "similar" amount to download that file from the cache or to download it via cargo, and 2) cargo will update an outdated registry anyway for most commands unless you use --offline.

So I think there are two useful "simple enough" behaviors here:

Use a checked in Cargo.lock. Use its hash as part of the cache key.
Don't check in Cargo.lock, and always run CI against the latest versions of dependencies. Run cargo generate-lockfile first, then use the generated lockfile's hash as part of the cache key.

And the idealized verison of the second case, which I don't think is (or will be in the near future) possible:

Don't check in Cargo.lock, and always run CI against the latest versions of depedencies. Run cargo generate-lockfile first, then use the generated lockfile's hash as part of the cache key, restoring from the head's cache, but only persisting the new or reused partial compilation artifacts, so there's no unused artifacts bloating the cache.

Personally, I think checking in lockfiles is the better plan, even for libraries, because it makes CI builds more deterministic, and less likely to break because of the outside world. I never want to be in a situation where CI passed on the head, but a PR fails due to no fault of its own, just because a library updated in a minorly breaking way. Rather, I think the better model is to put the Cargo.lock into the source control and have dependabot (or a similar bot) PR and bors r+ any lockfile updates as libraries release new versions.

mzabaluev commented 4 years ago

You probably want cargo generate-lockfile then.

That command needs at least ~/.cargo/registry/index to be up to date, so it too would benefit from a cache restore. However, that could well be a separate smaller cache, because the primary key of the index state should be the head commit in the rust-lang/crates.io-index GitHub repo, not the Cargo.lock of any particular project.

Also, general knowledge that I've absorbed is that .cargo/registry isn't worth caching because 1) it costs a "similar" amount to download that file from the cache or to download it via cargo,

A full restoration including the repository index is long and getting longer as the index is growing. Another motivation for caching is to avoid unnecessarily taxing the crates.io infrastructure.

2) cargo will update an outdated registry anyway for most commands unless you use --offline.

With a close enough state of the registry restored from the cache, there is not much to synchronize. A job in the workflow can perform cargo fetch or cargo update and store the perfect cache state for other jobs to pick up, which can then run cargo with --frozen --offline.

Don't check in Cargo.lock, and always run CI against the latest versions of depedencies. Run cargo generate-lockfile first, then use the generated lockfile's hash as part of the cache key, restoring from the head's cache, but only persisting the new or reused partial compilation artifacts, so there's no unused artifacts bloating the cache.

One current problem with this is the lack of an easy way to prune the outdated dependencies. There is a cargo-cache command where this could be implemented: https://github.com/matthiaskrgr/cargo-cache/issues/76.

Personally, I think checking in lockfiles is the better plan, even for libraries, because it makes CI builds more deterministic, and less likely to break because of the outside world.

It may be good for reproducible CI builds, but for regular developer working copies of library code a version-controlled Cargo.lock just gets in the way. There is no single golden state of the build for a library; everyone may want to build/test it locally against their choice of dependency versions, and keep this preference locked in the Cargo.lock while they are working on it.

Checking the baseline lockfile into a CI-managed branch away from the branch(es) that get checked out by developers could be a way to get the best of both worlds.

mzabaluev commented 4 years ago

For reference, here's a workflow using two stages of caching. First, a job runs cargo generate-lockfile and updates the crate cache keyed by the hash of Cargo.lock. Then a matrix of test jobs uses the cached dependency sources on all three supported OSes. The cargo registry filesystem, git checkouts and crate sources archived on Linux seem to work fine when restored on Mac or Windows.

Some follow-up on my earlier comments:

A full restoration including the repository index is long and getting longer as the index is growing.

cargo generate-lockfile making a full clone of the crates.io index repository takes about 5 s in this workflow, which is comparable to the time it takes to store it in the cache. Restoration time is about 2 s.

Another motivation for caching is to avoid unnecessarily taxing the crates.io infrastructure.

The crates.io index is hosted on GitHub, so access to it from GitHub CI should be fast and cheap (free?) and may use any backplane optimization magic invented by GitHub. The CDN hosting the crates is currently on AWS, though.

mzabaluev commented 4 years ago

More observations on windows-latest:

The ~/.cargo directory with a weeks old clone of the repository index and some cached crate sources is present in the vanilla build environment (https://github.com/actions/virtual-environments/issues/895).
cargo fetch on that state takes more than a minute to pull the index repository.
Unpacking the crates from ~/.cargo/registry/cache into ~/.cargo/registry/src can take quite some time. This can be measured by removing the src directory and running cargo fetch --locked --offline.
When building without prefetched crates, cargo may unpack the crate sources as needed in parallel with building the artefacts. The above overhead had no significant impact on the build time in my testing.

anp commented 4 years ago

Looks like the npm package for making use of the cache actions from other actions has released: https://github.blog/changelog/2020-05-27-github-actions-v2-cache-actions/.

Swatinem commented 3 years ago

👋 Reaching out here, since I have created https://github.com/Swatinem/rust-cache and we have been successfully test-driving it for a few sentry projects. https://github.com/Swatinem/rust-cache/issues/3 was asking about maybe moving it into actions-rs.

So far its very opinionated with regards to how I think caching should ideally work, but I’m open to ideas.

CAD97 commented 3 years ago

For tools specifically, https://github.com/actions/toolkit/tree/main/packages/tool-cache exists now and seems useful.

actions-rs / meta

Caching tracking issue #21