The GitHub Actions caches don't work for PRs

seisman commented 4 years ago

Description of the problem

See one of the recent PR for an example.

The caches are correctly downloaded, but the "cache" step still runs:

weiji14 commented 4 years ago

The download is saving it to '/home/runner/.gmt/server/earth/earth_relief/earth_relief_01d_g.grd.2'? Why the .2?

weiji14 commented 4 years ago

Ah, cause we're using wget, Maybe use wget -c so we don't repeat the download.

seisman commented 4 years ago

wget -c should work, but the problem is that the step should not be triggered at all.

seisman commented 4 years ago

This is from the master branch. PR should also work like this one.

weiji14 commented 4 years ago

If we use wget -c, the step will run once when the PR is opened (to pull in the master cache) and a PR cache will be created. Subsequent runs in the PR will use that PR cache. See also my note at https://github.com/GenericMappingTools/pygmt/pull/475#discussion_r447276141.

If we don't want to trigger the download step, then we would need to use a 'global' cache, which can be done by removing the -${{ github.ref }}- part and dropping the restore-keys: line here:

https://github.com/GenericMappingTools/pygmt/blob/f401f8527cfee4dc8f48d78ad93d23b678ecfbc8/.github/workflows/ci_tests.yaml#L94-L95

seisman commented 4 years ago

So each PR first downloads the master cache, then create its own PR cache. It means that the PR cache can be different from the master cache as the PR cache is always newer.

It may cause problems, because a PR may work with the PR cache (newer) but not with the master cache (older). When the PR merges into master branch, it may break.

weiji14 commented 4 years ago

It means that the PR cache can be different from the master cache as the PR cache is always newer.

Yes. We would then realize the cache needs updating on master and refresh the key accordingly. This actually means that PRs are unaffected (i.e. tests on PRs will always have a fresh cache). You can see it as a feature or a bug.

One thing we could look at is to add a date to the cache-key as outlined at https://github.com/actions/cache#creating-a-cache-key. Maybe do a cache that is refreshed 'monthly', but also have an environment key to do a manual bump if needed.

seisman commented 4 years ago

You can see it as a feature or a bug.

I can take it as a feature. I think we can first try if wget -c works.

seisman commented 4 years ago

wget -c doesn't work as we expected.

     Beginning with Wget 1.7, if you use -c on a file which is of equal size as the one on the server, Wget will refuse to download the file and print an
       explanatory message.  The same happens when the file is smaller on the server than locally (presumably because it was changed on the server since your
       last download attempt)---because "continuing" is not meaningful, no download occurs.

       On the other side of the coin, while using -c, any file that's bigger on the server than locally will be considered an incomplete download and only
       "(length(remote) - length(local))" bytes will be downloaded and tacked onto the end of the local file.  This behavior can be desirable in certain
       cases---for instance, you can use wget -c to download just the new portion that's been appended to a data collection or log file.

       However, if the file is bigger on the server because it's been changed, as opposed to just appended to, you'll end up with a garbled file.  Wget has no
       way of verifying that the local file is really a valid prefix of the remote file.  You need to be especially careful of this when using -c in
       conjunction with -r, since every file will be considered as an "incomplete download" candidate.

weiji14 commented 4 years ago

Sigh, cache invalidation is hard. I'd like to use rsync but it's for Linux/macOS only. Is there a way to download only if the hash has changed, or do we need to go for a python solution like Requests or Pooch? Also what's the situation with the Europe mirror working but not Oceania?

seisman commented 4 years ago

Currently, I'm using wget ${GMT_DATA_SERVER}/cache/${data} -O ~/.gmt/cache/${data} to always overwrite the existing file, so the PR cache won't store files like earth_relief_01d_g.grd.2.

Is there a way to download only if the hash has changed

Yes, it's possible, but need some complex coding.

do we need to go for a python solution like Requests or Pooch?

That's what I'm going to try. It's still unclear the slow connection is due to the github action hosts or the GMT server configurations.

what's the situation with the Europe mirror working but not Oceania?

Perhaps not. The Europe mirror seems to be working for one time, then always fails. Maybe I incorrectly read the CI log.

weiji14 commented 4 years ago

Is there a way to download only if the hash has changed

Yes, it's possible, but need some complex coding.

That's what I'm afraid of. Actually, I wonder if this https://github.com/marketplace/actions/action-rsync is cross-platform (read: works on Windows).

seisman commented 4 years ago

It needs a SSH key, and it may only work for a whole directory or a single file.

weiji14 commented 4 years ago

How about using wget --no-clobber which would 'skip downloads that would download to existing files'. We'll lose the 'feature' we thought wget -c would provide, but I think that's fine.

seisman commented 4 years ago

The connection to the GMT data server is slow, but the downloading is very fast. The cache step can be greatly improved if only one single wget command is used.

weiji14 commented 4 years ago

I don't want over-engineer this, but how about separating the cache mechanism to another CI task, e.g. using upload-artifact and download-artifact. The idea would be to:

Have a 'Cache CI' to pull the data from the GMT data server using macOS and have it upload the artifact
Let our Windows/Linux/macOS 'Tests CI' download the artifact.

seisman commented 4 years ago

The artifact sounds a good solution!

weiji14 commented 4 years ago

It's very fast too, 2 seconds to download the 10.2 MB files! I've made a PR at #530.

GenericMappingTools / pygmt

The GitHub Actions caches don't work for PRs #523