actions / cache

Cache dependencies and build outputs in GitHub Actions
MIT License
4.5k stars 1.2k forks source link

Can't store using hashFiles if files don't exist yet #106

Closed rye closed 2 years ago

rye commented 4 years ago

For example, in a Rust project, the target directory is almost never committed as it includes build artifacts. I'd like to cache it between builds, but using

name: Cache target dir
uses: actions/cache@v1
with:
  path: target
  key: ${{ runner.os }}-cargo-target-${{ hashFiles('target') }} # also tried variants of `target/**`...

triggers a failure that prevents the build from continuing:

##[error]The template is not valid. hashFiles('/home/runner/work/.../.../target') failed. Search pattern '/home/runner/work/.../.../target' doesn't match any file under '/home/runner/work/.../...'

which does make sense! After checking out, there is not yet a target directory. In fact, that's what I aim to create (maybe) when I restore from the cache. So storing would work with this template, but restoring would not, since the key can't be computed.

If I supply restore-keys, though, I still get the same error. It seems to me like something (maybe not in this codebase) is just dying early because the template isn't fully computable, instead of trying to fall back on the restore-keys first.

rye commented 4 years ago

So my desired behavior looks like this:

In short, this is what a potential

I hope that helps clarify what I'm looking for.

dhadka commented 4 years ago

Hi @rye, in the Rust-Cargo example configuration, it caches target using a hash of the Cargo.lock file(s):

- name: Cache cargo build
  uses: actions/cache@v1
  with:
    path: target
    key: ${{ runner.os }}-cargo-build-target-${{ hashFiles('**/Cargo.lock') }}

Would this handle your use case? This would cache target and get updated whenever the project's dependencies change. If not, could you please clarify why? Thanks!

rye commented 4 years ago

Hi @dhadka, thanks for getting back to me on this.

First, I'll note that Cargo.lock is not the only determiner of the contents of the target directory. Different nightly Rust compiler versions could result in different build intermediates, too, if something changes internally—I'd rather let cargo determine if it needs to change anything from a cached target dir than rely on Cargo.lock to describe the state of everything.

But my main reason for wanting this change is that I use other tools that are outside of my dependency manifest, for example cargo-tarpaulin, which takes 10 minutes to build for each from-scratch installation. To compile, I add env CARGO_TARGET_DIR=target to the invocation of cargo install, which makes it share the target directory. Compilation, it turns out, is the most expensive part of using such tools, which is the main reason why I want to cache them. I certainly could add cargo-tarpaulin to my dev-dependencies, but given that it's a standalone tool I'd rather keep it outside of my manifest.

I hope that helps explain why I feel a change is needed! Thanks again.

chrispat commented 4 years ago

I think the core of this discussion really boils down to helping the user understand they have specified a pattern in hashFiles that does not match anything or not. In the current model if I typeo something and no files are found then the user will know that and can try to debug. If we move to the model where the files don't need to exist then the user will have no idea until they get a few runs in and realize they are not getting updated caches.

rye commented 4 years ago

I do agree that the current error form makes things stricter, but I don't think the strictness is helpful—a hashFiles error (due to files not existing) seems to be causing a parsing error, i.e. restore_keys is not even used at all; the cache restore just completely fails. My use case would be satisfied if restore_keys was used in such cases.

Of course, I could leave off the hashFiles in my key, but then I would basically have a persistent cache that never got updated if things changed.

chrispat commented 4 years ago

@rye perhaps we could update hashFiles with an additional parameter that you could set that would have it not fail if it didn't find any files. I suppose the other option is to simply warn rather than fail, however, that would mean in your case that you would perpetually have a warning on your jobs that you could likely never get rid of.

rye commented 4 years ago

@chrispat yeah, that first idea sounds like a good solution to me, if that works best.

I am still wondering if it is possible to just handle the case where hashed files don't exist by falling back on restore-keys (and possibly proceeding to fail if those keys aren't specified or can't be restored)? I'd be happy to make those changes, but I don't know where I could change that behavior.

What this would mean for the user is:

uses: actions/cache@v1
with:
  path: target
  key: cargo-deps-${{ hashFiles('target') }}
  restore-keys:
    - cargo-deps-

In the event that hashFiles cannot find its parameter, it tries to find something in restore-keys that does exist. Here it finds the most recent cache, cargo-deps-abcdef, and restores it, proceeding to continue through the build. If the path does not change from its restored value, cargo-deps-abcdef, (i.e. at the end of the build, if the key is recomputed as cargo-deps- abcdef) then nothing has changed and we do not need to store again. However, if something in the specified path changed, then a new cache is produced, cargo-deps-012345, which would then be restored later.

In short, I would still expect a failure of hashFiles to cause the cache restoration to try to fall back on whatever is specified in restore_keys—if that change is made, no additional parameters would be needed, and the existing API would remain exactly the same. Indeed, I don't think anything relating to caching should stop a build from continuing by default, but that is configurable through a parameter.

douglascayers commented 4 years ago

+1 for @rye's idea.

Regarding @chrispat comment,

I think the core of this discussion really boils down to helping the user understand they have specified a pattern in hashFiles that does not match anything or not.

I vote for giving the developer the benefit of the doubt that they are intelligent and chose the correct file path, even if that file path might not yet exist in their repo but would eventually show up in their build process.

Regarding the argument that hashFiles should fail if it doesn't find at least one file in the project I think could be argued against by saying it doesn't guard against finding more files than the developer anticipated. If the hashFiles function allows for unbounded number of files to be found, zero matches seems reasonable to me too.

If the overall build workflow isn't working as the user anticipated then the developer would go debug it anyways. A message in the log at runtime by the cache action to say it found a hit or not should suffice for the developer to figure out what's going on to realize they have a typo or the file(s) they expected to be there never showed up, etc.

Thanks

samtstern commented 4 years ago

Just want to +1 this feature, my CI process downloads a .jar file that I only want to download once, so I have this:

    env:
      FIREBASE_EMULATORS_PATH: ${{ github.workspace }}/emulator-cache

// ...

    - name: 'Cache Emulators'
      uses: actions/cache@v1
      with:
        path: ${{ env.FIREBASE_EMULATORS_PATH }}
        key: ${{ runner.os }}-firebase-emulators-${{ hashFiles('emulator-cache/**') }}
      continue-on-error: true

This doesn't work because there's nothing in the emulator-cache directory when I start so hashFiles() fails. I could probably work around this by creating a dummy file there in a previous run step but I would really prefer not to.

chrispat commented 4 years ago

We have been looking at this problem from a few different angles and while it would be easy to make hashFiles just return an empty string if it does not find any files that won't fully solve the probelm. The action stores the key that is generated when it is run and uses that to save the cache at the end so in this case the key would be generated with no files and even if those files existed at the end of the run they would not be considered as part of the key.

I think the only way to make this work well would be to have independent restore and save actions which would lead to even more configuration.

rye commented 4 years ago

The action stores the key that is generated when it is run and uses that to save the cache at the end [...]

Yeah, I think that's definitely a big part of the problem. The other part is that hashFiles will fail the build (inelegantly) if it can't find the files. Something about both of these parts of the runner's behavior would need to change to extend the existing syntax so that the key can be recomputed for comparison at the end of the build, which wouldn't necessarily require breaking backwards compatibility or adding any configuration.

Indeed, having independent save and restore actions would be an alternative to making hashFiles handle more situations; for what it's worth, at least one other major provider that I'm aware of (CircleCI) has save_cache and restore_cache steps which are used to interact with their caching system, which I do think is nice because it lets you control when your storage is performed.

CryZe commented 4 years ago

The way this action currently works seems really suboptimal. It only works if your Cargo.lock basically never changes. But having a Cargo.lock is advised against for libraries as you want to test your library against a wide range of semver compatible versions as this is how your library is going to be used in the end. So most Rust projects don't have a Cargo.lock for that reason.

You could generate one on the fly and use that as the hash, but the chances that nothing changed are very slim. So in the end you likely don't ever get a cache hit and end up recompiling everything from scratch.

So what we ended up doing is scrapping the whole hashing mechanism and just using the number of the week in the year as the cache key, as that keeps a reasonably recent cache around and overwrites it with a new one every week (actually every second week). This is so far the best workaround I've seen for Rust projects (or similar) where you don't really have a lockfile.

mzabaluev commented 4 years ago

135 is a proposal aiming to solve this.

mzabaluev commented 4 years ago

@CryZe with proximate caching, one problem is that the cargo registry gets bloated over time as old crate archives get dragged from old cached states to newer ones.

This example demonstrates how to bootstrap Cargo.lock from a smaller cache of the registry index that does usually get restored from secondary matches because the remote changes so often. Depending on how often the dependencies get refreshed, this lockfile may be good for meeting a match in the main cache for some time.

mzabaluev commented 4 years ago

I have proposed new examples to examples.md in #325 that show how to bootstrap Cargo.lock in a library project. This would not solve the wider issue that is discussed here, though.

mrmckeb commented 3 years ago

I've hit this issue today, and perhaps have a clearer use-case for the same feature.

With a very large project, ESLint can perform slowly. The ESLint team have added the ability to create a cache file, which brings a massive performance improvement.

Many projects (including those built with Create React App) store that file in node_modules/.cache), so it's never committed - and that makes sense, as it's a cache. For our CI though, that means it will always run without a cache, as we can't cache the file after it's created.

I don't know that this means this action would need separate save/restore actions, perhaps the action just needs to be able to recalculate the key (optionally) before saving?

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 200 days with no activity. Leave a comment to avoid closing this issue in 5 days.

github-actions[bot] commented 2 years ago

This issue was closed because it has been inactive for 5 days since being marked as stale.

menny commented 2 years ago

A few days ago I was hit with this issue. The issue only happens in the post handler of the action. The files were still there (I did a ls and find to verify), so I assumed that it is due to some kind of permissions issue - I'm running my checks inside a docker container and it uses root as user. Anyhow, I chmod the entire repo at the last step of all the checks, and it fixed my issue: https://github.com/AnySoftKeyboard/AnySoftKeyboard/pull/3434/commits/b3d74ecbb406952a2052caa3931b54c2932785c5

I assume that somehow the owner of the files was changed, and hashFiles could not traverse the folder structure and failed.

dsgriffin commented 1 year ago

Just want to +1 this feature, my CI process downloads a .jar file that I only want to download once, so I have this:

    env:
      FIREBASE_EMULATORS_PATH: ${{ github.workspace }}/emulator-cache

// ...

    - name: 'Cache Emulators'
      uses: actions/cache@v1
      with:
        path: ${{ env.FIREBASE_EMULATORS_PATH }}
        key: ${{ runner.os }}-firebase-emulators-${{ hashFiles('emulator-cache/**') }}
      continue-on-error: true

This doesn't work because there's nothing in the emulator-cache directory when I start so hashFiles() fails. I could probably work around this by creating a dummy file there in a previous run step but I would really prefer not to.

any solution for this in the end? still having this issue when trying to cache Firebase Emulators now