GenericMappingTools / gmtserver-admin

Cache data and script for managing the GMT data server
GNU Lesser General Public License v3.0
7 stars 3 forks source link

The naming of the remote files #44

Closed PaulWessel closed 4 years ago

PaulWessel commented 4 years ago

Since we are getting close to the release of 6.1 we need to finalize decisions on names. To set the stage, let me paint the picture of the plans ahead. These will be accelerated should our NASA proposal be funded, but is likely to happen regardless. The idea is to make GMT the simplest tool to make maps using remote data. Given we already serve earth_relief in various resolutions, and from 6.1 also in both pixel and gridline registration, you know the basics. Here are two things from the NASA proposal that will affect our work:

  1. For plotting maps (not computing), it will be allowed to not specify the resolution. I.e., I would just say gmt grdimage earth_relief -pdf mapand GMT will select the appropriate grid resolution that will render a map of the requested dimensions (implicitly 15cm here) at a stated resolution (or higher). The stated resolution would be a new GMT defaults: GMT_IMAGE_RESOLUTION [300]. The reason for this is that the common man cannot be trusted to pick the right grid resolution when making a map. All of this is seamless and under the hood and new data automatically are downloaded to the user after we refresh the server.
  2. We plan to add relief, gravity, and imagery for other planetary bodies (Mars, Moon, Venus, Mercury, etc). For Earth we have made a deal with EarthByte to distribute earth_age_xxy_g|p.grd and we will work with Sandwell to provide earth_gravity_xxy_g|p.grd as well.
  3. Because the initial download of the 15s earth_relief file takes a long time, we plan to split these files into tiles (similar to SRTM but larger) so that users only need to download the tiles they need; this will dramatically speed up response times and avoid 3.1 Gb initial downloads.

Given those plans, I would prefer to have a common naming scheme [and this also affects the organization of the server directories (#37) in one new way]. I imagine the layout and names would be something like this:

server
   earth
      relief
        earth_relief_xxy_g|p.grd
        ........
      gravity
        earth_gravity_xxy_g|p.grd
        ........
      mask
        earth_mask_xxy_g|p.grd [land = 1, water = 0]
      images
         earth_daytime_xxy_p.tif
          ........
         earth_nighttime_xxy_p.tif
  moon
     relief
        moon_relief_xxy_g|p.grd
....
  mars
      relief
         mars_relief_xxy_g|p.grd
      images
         .....

From this layout, I hope you understand why I do not support having the BlueMarble and BlackMarble be named that way in GMT. Those names mean something to those who are aware of them, but if you are not then it is not obvious what those data mean. I argue that earth_daytime and earth_nighttime (or similar) would be clearer and fit into the naming hierarchy of the above plan. Other than the names, this layout means I believe we should move the earth_relief files to that new subdirectory since 6.1 is not out yet so now is the time to have a permanent directory structure that easily can accommodate new datatypes and planetary bodies.

A final reminder: when the user first access a remote file, we put up a notice with the reference and credits (such as for BlueMarble etc), e.g.,

grdinfo [NOTICE]: Earth Relief at 1x1 arc degrees from Gaussian Cartesian filtering (111 km fullwidth) of SRTM15+V2.1 [Tozer et al., 2019].

I hope you will approve of this plan. I would like feedback from @GenericMappingTools/core on this.

seisman commented 4 years ago

So you're proposing to have /server/earth/relief/earth_relief_xxy_p|g.grd instead of /server/earth_relief/earth_relief_xxy_p|g.grd.

The former path is one lever deeper. Does it make the code more complicated when gmt tries to look for the data files locally (more subdirectories to search for)?

PaulWessel commented 4 years ago

No complication. We create the expected path and the path is either valid or not (access).

PaulWessel commented 4 years ago

Yes, we do need to make more subdirs on the user side on demand, but that can be done by a single smart function.

PaulWessel commented 4 years ago

The proposed scheme allows for expansion of the gmt clear command as well. Maybe to delete all the data from Mars would be gmt clear data mars.

WalterHFSmith commented 4 years ago

This sounds good to me. I can foresee that there will be misunderstandings about pedigrees for data sources, so it is good that you are having the comments/notifications about attributes. w

leouieda commented 4 years ago

Hi @PaulWessel I like this organization better than what we had before (and the naming scheme). The citations could be included in the grid metadata if they aren't already as well.

One thing to be careful when changing this is that older GMT versions will then break when remote files are requested. So a user of 6.0 trying to do gmt grdimage @earth_relief_15s after the migration would get a download error. If the storage requirements aren't too large, we might need to consider having versioned directories on the server.

Alternatively, we can have the data locations encoded in a file that GMT gets from the server. That way, older versions can still be able to find the data since it wouldn't be hardcoded.

PaulWessel commented 4 years ago

Yes, we are solving this with symbolic links on the server. it will be 100% backwards compatible.

WalterHFSmith commented 4 years ago

Hi, Paul and Leo,

Reference to version number reminds me that some of these data sets (e.g. Sandwell and Smith) are now into version XX where XX > 20, and new versions pop up from time to time.

Should we be supporting the archiving of old versions as an option for the GMT user? (In the case of Smith/Sandwell / Sandwell/Smith I think the intent in my shop at NOAA is to archive all the old versions.)

I can imagine the situation where a user wants to re-create a plot EXACTLY as it looked 3 years before, including with an old version (e.g. she is trying to illustrate that an island used to be in the wrong place and now it is not).

W

leouieda commented 4 years ago

Hi Water, this is something we discussed several times already. I'm all for encouraging reproducibility but I'm not keen to get into the dependency resolution game (and I gather neither is Paul). The idea was to serve the latest data, which means that scripts can break if the data change significantly.

leouieda commented 4 years ago

The solution there is to be clear in the documentation that you should not use these datasets if you require reproducibility. We can recommend people to download them initially and then copy them from the GMT folder for archival.

PaulWessel commented 4 years ago

Yes, experts who know what they are doing and need version XX of some product will need to get that from wherever they usually do. We will only serve the very latest releases of the datasets we provide. Anything else would be a big time-sink.

WalterHFSmith commented 4 years ago

OK, agreed this is best.

PaulWessel commented 4 years ago

I will soon start work on #46 which is previewing a bit of the things in the NASA proposal (still pending!). BTW, UNAVCO will be an official data server as well.

leouieda commented 4 years ago

@PaulWessel what we could do is archive the data in Zenodo whenever we change it. That's not much trouble. The website can link to historic datasets and the grids can include the Zenodo DOI. It could be 1 upload per set of grids (all resolutions for example) to make it easier. Then we can point people to Zenodo if they need. That way they can get our GMT-ready version of the grids.

leouieda commented 4 years ago

Zenodo has an API so the upload could even be automated with a script.

PaulWessel commented 4 years ago

So no issues with archiving potentially 100s of GB? So if Dietmar changes his age grid we make another archive of everything, including SRTM tiles, etc? I am not opposed, just drowning in GMT changes right now...

leouieda commented 4 years ago

So no issues with archiving potentially 100s of GB?

They archive things from the ATLAS experiment so they can handle some puny grids :slightly_smiling_face: Their policy says 50Gb per dataset but no limit per account. So as long as we publish individual grids or datasets, we should be fine. https://help.zenodo.org/

So if Dietmar changes his age grid we make another archive of everything, including SRTM tiles, etc?

No no, only the different resolution grids of the data that changed. So we would have an "Earth relief v1.0" release and then "Earth relief v2.0" and so on.

I am not opposed, just drowning in GMT changes right now...

This doesn't have to be done for 6.1. It will be a while until datasets change and it won't happen often. We can try an upload later on to test it out. If it's too much trouble we can always abandon it.

leouieda commented 4 years ago

I'll have a look at their API when I have some time (https://developers.zenodo.org/#quickstart-upload). Might be able to make a script that uploads individual grids and writes their DOI to the netCDF metadata. That way, we can include this as the last step in the build process for new grids.

PaulWessel commented 4 years ago

It does sounds like a nice workflow.

leouieda commented 4 years ago

Seems kind of straight forward actually. The script can reserve the DOI and write it to the grid. We can even check the MD5 of the upload against the local file to make sure it worked. Will have to think about security a bit since it requires an access token which would have to be kept encrypted somehow.