Closed PaulWessel closed 4 years ago
So you're proposing to have /server/earth/relief/earth_relief_xxy_p|g.grd
instead of /server/earth_relief/earth_relief_xxy_p|g.grd
.
The former path is one lever deeper. Does it make the code more complicated when gmt tries to look for the data files locally (more subdirectories to search for)?
No complication. We create the expected path and the path is either valid or not (access).
Yes, we do need to make more subdirs on the user side on demand, but that can be done by a single smart function.
The proposed scheme allows for expansion of the gmt clear command as well. Maybe to delete all the data from Mars would be gmt clear data mars
.
This sounds good to me. I can foresee that there will be misunderstandings about pedigrees for data sources, so it is good that you are having the comments/notifications about attributes. w
Hi @PaulWessel I like this organization better than what we had before (and the naming scheme). The citations could be included in the grid metadata if they aren't already as well.
One thing to be careful when changing this is that older GMT versions will then break when remote files are requested. So a user of 6.0 trying to do gmt grdimage @earth_relief_15s
after the migration would get a download error. If the storage requirements aren't too large, we might need to consider having versioned directories on the server.
Alternatively, we can have the data locations encoded in a file that GMT gets from the server. That way, older versions can still be able to find the data since it wouldn't be hardcoded.
Yes, we are solving this with symbolic links on the server. it will be 100% backwards compatible.
Hi, Paul and Leo,
Reference to version number reminds me that some of these data sets (e.g. Sandwell and Smith) are now into version XX where XX > 20, and new versions pop up from time to time.
Should we be supporting the archiving of old versions as an option for the GMT user? (In the case of Smith/Sandwell / Sandwell/Smith I think the intent in my shop at NOAA is to archive all the old versions.)
I can imagine the situation where a user wants to re-create a plot EXACTLY as it looked 3 years before, including with an old version (e.g. she is trying to illustrate that an island used to be in the wrong place and now it is not).
W
Hi Water, this is something we discussed several times already. I'm all for encouraging reproducibility but I'm not keen to get into the dependency resolution game (and I gather neither is Paul). The idea was to serve the latest data, which means that scripts can break if the data change significantly.
The solution there is to be clear in the documentation that you should not use these datasets if you require reproducibility. We can recommend people to download them initially and then copy them from the GMT folder for archival.
Yes, experts who know what they are doing and need version XX of some product will need to get that from wherever they usually do. We will only serve the very latest releases of the datasets we provide. Anything else would be a big time-sink.
OK, agreed this is best.
I will soon start work on #46 which is previewing a bit of the things in the NASA proposal (still pending!). BTW, UNAVCO will be an official data server as well.
@PaulWessel what we could do is archive the data in Zenodo whenever we change it. That's not much trouble. The website can link to historic datasets and the grids can include the Zenodo DOI. It could be 1 upload per set of grids (all resolutions for example) to make it easier. Then we can point people to Zenodo if they need. That way they can get our GMT-ready version of the grids.
Zenodo has an API so the upload could even be automated with a script.
So no issues with archiving potentially 100s of GB? So if Dietmar changes his age grid we make another archive of everything, including SRTM tiles, etc? I am not opposed, just drowning in GMT changes right now...
So no issues with archiving potentially 100s of GB?
They archive things from the ATLAS experiment so they can handle some puny grids :slightly_smiling_face: Their policy says 50Gb per dataset but no limit per account. So as long as we publish individual grids or datasets, we should be fine. https://help.zenodo.org/
So if Dietmar changes his age grid we make another archive of everything, including SRTM tiles, etc?
No no, only the different resolution grids of the data that changed. So we would have an "Earth relief v1.0" release and then "Earth relief v2.0" and so on.
I am not opposed, just drowning in GMT changes right now...
This doesn't have to be done for 6.1. It will be a while until datasets change and it won't happen often. We can try an upload later on to test it out. If it's too much trouble we can always abandon it.
I'll have a look at their API when I have some time (https://developers.zenodo.org/#quickstart-upload). Might be able to make a script that uploads individual grids and writes their DOI to the netCDF metadata. That way, we can include this as the last step in the build process for new grids.
It does sounds like a nice workflow.
Seems kind of straight forward actually. The script can reserve the DOI and write it to the grid. We can even check the MD5 of the upload against the local file to make sure it worked. Will have to think about security a bit since it requires an access token which would have to be kept encrypted somehow.
Since we are getting close to the release of 6.1 we need to finalize decisions on names. To set the stage, let me paint the picture of the plans ahead. These will be accelerated should our NASA proposal be funded, but is likely to happen regardless. The idea is to make GMT the simplest tool to make maps using remote data. Given we already serve earth_relief in various resolutions, and from 6.1 also in both pixel and gridline registration, you know the basics. Here are two things from the NASA proposal that will affect our work:
gmt grdimage earth_relief -pdf map
and GMT will select the appropriate grid resolution that will render a map of the requested dimensions (implicitly 15cm here) at a stated resolution (or higher). The stated resolution would be a new GMT defaults: GMT_IMAGE_RESOLUTION [300]. The reason for this is that the common man cannot be trusted to pick the right grid resolution when making a map. All of this is seamless and under the hood and new data automatically are downloaded to the user after we refresh the server.Given those plans, I would prefer to have a common naming scheme [and this also affects the organization of the server directories (#37) in one new way]. I imagine the layout and names would be something like this:
From this layout, I hope you understand why I do not support having the BlueMarble and BlackMarble be named that way in GMT. Those names mean something to those who are aware of them, but if you are not then it is not obvious what those data mean. I argue that earth_daytime and earth_nighttime (or similar) would be clearer and fit into the naming hierarchy of the above plan. Other than the names, this layout means I believe we should move the earth_relief files to that new subdirectory since 6.1 is not out yet so now is the time to have a permanent directory structure that easily can accommodate new datatypes and planetary bodies.
A final reminder: when the user first access a remote file, we put up a notice with the reference and credits (such as for BlueMarble etc), e.g.,
grdinfo [NOTICE]: Earth Relief at 1x1 arc degrees from Gaussian Cartesian filtering (111 km fullwidth) of SRTM15+V2.1 [Tozer et al., 2019].
I hope you will approve of this plan. I would like feedback from @GenericMappingTools/core on this.