actions / cache

Cache dependencies and build outputs in GitHub Actions
MIT License
4.46k stars 1.19k forks source link

Provide a mechanism to have a per file cache eviction/retention #788

Open devminded opened 2 years ago

devminded commented 2 years ago

This is related to issue https://github.com/actions/setup-java/issues/269

The problem is that caches fill up over time as dependencies, runtimes, and tooling are upgraded. Old files are never evicted and the cache grows. The current solution is to recalculate the cache-key at every build (base it on the week number or such) and throw it all away, but that works against the purpose of a cache to begin with.

I suggest that when saving the caches we should be able to evict files older than a configurable number of days. That way old dependencies will be removed over time and we can have the best of both worlds.

PS. I'm not sure how the cache-hit logic works in this scenario.

Something like this:

- name: Configure Gradle JDK cache
    uses: actions/cache@v3
    with:
      path: ~/.gradle/jdks
      key: gradle-jdks-${{ runner.os }}
      # Evict files older than 30 days from the cache and repackage.
      eviction:
        include:    # required, we don't want to remove just any file and corrupt the cache
          - **/*.zip
          - **/*.jar
          - **/*.tar
        days: 30
bishal-pdMSFT commented 2 years ago

@devminded can you please help me understand the use case better?

The problem is that caches fill up over time as dependencies, runtimes, and tooling are upgraded.

If I am not wrong, the dependencies get updated for every build and that means the old files go away. And hence the cache also will only have latest files.

Is this problem more with runtimes and tooling where multiple versions may exist side by side? If so then the problem may be much less impacting as such version changes would not happen too frequently. Am I reading this wrong?

devminded commented 2 years ago

Not sure if I have misunderstood something in how the cache-mechanism works.

As far as I understand, a cache-hit is simply that we found a cache with a matching key, that is then restored. How we calculate this key each build will affect if we restore the cache or not.

The issue is that for example maven/gradle saves all the dependencies, toolchains, wrappers, etc. in a directory. Gradle for example has a default 30 day eviction from some of these directories, but (AFAIK) it's based on "last accessed time" which seems to break when using GitHub caches, so after a while every new cache-file becomes larger.

Some things can be managed by being picky how we generate the cache key (like hashing the gradle-wrapper file), but that has two issues:

  1. Changing the cache-key will cause a cache-miss and the entire cache will be thrown away even if the change is tiny. And the next build will have to download the internet again.
  2. Some things does not have a simple "file to hash" to generate a key but it's part of a larger build file that changes often.

What I feel is missing is some kind of middle ground where we can evict content based on some rule (so it's excluded when packing the cache).

Perhaps I'm missing something obvious here.

bishal-pdMSFT commented 2 years ago

@devminded looks like your ask is to be able to update a cache. Something similar to #342 ?

Essentially you want to:

There are two parts to it which are not possible today:

  1. There is no provision to update a cache. Cache in immutable and you can only create a new cache and hence the key is supposed to be more dynamic. This ask seems similar to #342. Can't you use a more static restore-key to be able to reuse older cache?
  2. Even if a cache can get updated, it gets stored as a single archived tar. Hence it is not possible to "purge" some directories from there. Can you work around this by only caching the directories which don't grow unboundedly? You may want to use a different cache key for the unbounded ones with a timestamp based key so that it gets purged periodically.
devminded commented 2 years ago

I understand that it goes into one large tar that gets packed at the end of the build. The problems is just that that the source for the tar is a bunch of directories that our build tools fills with stuff but are unable to clean due to being based on timestamps and the cache pack-unpack mechanism seems to do something with the timestamps.

I guess I will do what I wrote in my original post and base the cache key on the week-number or something.

With that said I would then like to propose the following: The actions for setup-java, setup-node, etc has a cache property where it then sets up the cache and keys for gradle, maven, node, etc. Can we add a new field "append-cache-key" to those actions where we can add extra info (like a week-number) that get appended to the generated cache keys? That way we still have some additional options for the keys.

codylerum commented 1 year ago

This is a pretty common issue with maven caches. If you have a dependency of foo-1.0.0.jar and then upgrade to foo-1.0.1.jar the original foo-1.0.0.jar will stay in the cache forever.

I have a step in my builds to remove those dirs from .m2 at the end of the build if the last accessed time is older than that of a dir that I created /var/oss-test at the start of the build

- name: Remove Unused Cache
        run: |
          sudo find ~/.m2 ! -neweraa /var/oss-test -iname '*.pom' | while read pom; do parent=`dirname "$pom"`; rm -Rf "$parent"; done

Something built in to delete the dir for the maven dep if not accessed in X days would be nice and would reduce the cache size for a lot of people significantly.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 200 days with no activity. Leave a comment to avoid closing this issue in 5 days.

williamdes commented 1 year ago

🏓

jcadavez commented 1 year ago

I'd like this feature too. There are some caches that I'm ok to let it expire to the default 7 days.

But, there are larger caches that I'd like to store for about 1-2 days. But, there's no GA input to specify such.

github-actions[bot] commented 8 months ago

This issue is stale because it has been open for 200 days with no activity. Leave a comment to avoid closing this issue in 5 days.

williamdes commented 8 months ago

You shall not close

aaronadamsCA commented 7 months ago

@bishal-pdMSFT, I think the ask here is simply to delete stale files during cache restore (or save), based on configurable name patterns and maximum age.

@devminded, you may be able to do a version of this yourself with an additional workflow step at the end of your job:

- name: Delete cached files not modified in the last 30 days
  run: find . -type f -mtime +29 -name "*.jar" -name "*.tar" -name "*.zip" -delete
  working-directory: ~/.gradle/jdks

Ideally this would use the last access time, but this cache action doesn't appear to preserve atime on restore. Evicting files based on last modified time is probably wrong for most use cases, but also probably fine, as long as you can heal the cache by re-generating or re-downloading missing files.

If this action could preserve atime, that would be great; if it could automatically enforce a file retention policy for me based on atime, that would be even better.