Closed seisman closed 4 years ago
The download is saving it to '/home/runner/.gmt/server/earth/earth_relief/earth_relief_01d_g.grd.2'? Why the .2
?
Ah, cause we're using wget
, Maybe use wget -c
so we don't repeat the download.
wget -c
should work, but the problem is that the step should not be triggered at all.
This is from the master branch. PR should also work like this one.
If we use wget -c
, the step will run once when the PR is opened (to pull in the master cache) and a PR cache will be created. Subsequent runs in the PR will use that PR cache. See also my note at https://github.com/GenericMappingTools/pygmt/pull/475#discussion_r447276141.
If we don't want to trigger the download step, then we would need to use a 'global' cache, which can be done by removing the -${{ github.ref }}-
part and dropping the restore-keys:
line here:
So each PR first downloads the master cache, then create its own PR cache. It means that the PR cache can be different from the master cache as the PR cache is always newer.
It may cause problems, because a PR may work with the PR cache (newer) but not with the master cache (older). When the PR merges into master branch, it may break.
It means that the PR cache can be different from the master cache as the PR cache is always newer.
Yes. We would then realize the cache needs updating on master and refresh the key accordingly. This actually means that PRs are unaffected (i.e. tests on PRs will always have a fresh cache). You can see it as a feature or a bug.
One thing we could look at is to add a date to the cache-key as outlined at https://github.com/actions/cache#creating-a-cache-key. Maybe do a cache that is refreshed 'monthly', but also have an environment key to do a manual bump if needed.
You can see it as a feature or a bug.
I can take it as a feature. I think we can first try if wget -c
works.
wget -c
doesn't work as we expected.
Beginning with Wget 1.7, if you use -c on a file which is of equal size as the one on the server, Wget will refuse to download the file and print an explanatory message. The same happens when the file is smaller on the server than locally (presumably because it was changed on the server since your last download attempt)---because "continuing" is not meaningful, no download occurs. On the other side of the coin, while using -c, any file that's bigger on the server than locally will be considered an incomplete download and only "(length(remote) - length(local))" bytes will be downloaded and tacked onto the end of the local file. This behavior can be desirable in certain cases---for instance, you can use wget -c to download just the new portion that's been appended to a data collection or log file. However, if the file is bigger on the server because it's been changed, as opposed to just appended to, you'll end up with a garbled file. Wget has no way of verifying that the local file is really a valid prefix of the remote file. You need to be especially careful of this when using -c in conjunction with -r, since every file will be considered as an "incomplete download" candidate.
Currently, I'm using wget ${GMT_DATA_SERVER}/cache/${data} -O ~/.gmt/cache/${data}
to always overwrite the existing file, so the PR cache won't store files like earth_relief_01d_g.grd.2
.
Is there a way to download only if the hash has changed
Yes, it's possible, but need some complex coding.
do we need to go for a python solution like Requests or Pooch?
That's what I'm going to try. It's still unclear the slow connection is due to the github action hosts or the GMT server configurations.
what's the situation with the Europe mirror working but not Oceania?
Perhaps not. The Europe mirror seems to be working for one time, then always fails. Maybe I incorrectly read the CI log.
Is there a way to download only if the hash has changed
Yes, it's possible, but need some complex coding.
That's what I'm afraid of. Actually, I wonder if this https://github.com/marketplace/actions/action-rsync is cross-platform (read: works on Windows).
It needs a SSH key, and it may only work for a whole directory or a single file.
How about using wget --no-clobber
which would 'skip downloads that would download to existing files'. We'll lose the 'feature' we thought wget -c
would provide, but I think that's fine.
The connection to the GMT data server is slow, but the downloading is very fast. The cache step can be greatly improved if only one single wget
command is used.
I don't want over-engineer this, but how about separating the cache mechanism to another CI task, e.g. using upload-artifact and download-artifact. The idea would be to:
The artifact sounds a good solution!
It's very fast too, 2 seconds to download the 10.2 MB files! I've made a PR at #530.
Description of the problem
See one of the recent PR for an example.
The caches are correctly downloaded, but the "cache" step still runs: