Organization of server data

PaulWessel commented 4 years ago

Currently, we only have earth_relief_xxy files in the gmt/data directory (everything else is under gmt/data/cache). However, we are about to add both blue and black marbles, the global crustal ages, and it is likely there will be more data sets in the future that should not be considered for cache (since they will have multiple resolutions etc). To peak ahead, it is likely we will split large global items into tiles, similar to SRTM. Whether we do that or not right now, it seems we should think about organization. How about this:

gmt/data/cache: Odds and ends used for tests and examples tutorial etc.
gmt/data/server: Data served by us.  In here there will be subdirectories:
    earth_relief
    earth_ages
    earth_marble [maybe one each for black and blue, or some clever scheme]
    ...

Inside these directories are the actual files: earth_relief_xym plus srtm1, strtm3 will be in the earth_relief folder, etc.

Perhaps the gmtserver needs to produce or maintain a listing of what is in server so that gmt can discover that we have added more data. We would at least need to know if a dataset is tiled or not to know what to do. I think the decisions that happen in gmt_remote.c depending on earth_relief resolution (get file or get tiles) need to be abstracted away and be based on a setup file we refresh, just like we refresh the hashes.

PaulWessel commented 4 years ago

One issue up front: If we make any structural changes to the gmtserver directories, we break access for everybody else. I think curl will follow symbolic links? If so then we could add links such as earth_relief_01m.grd that points to, say, earth_relief/earth_relief_01mg.grd (Notice the g for "Gridline-registered"). I guess we can do an experiment on that.

PaulWessel commented 4 years ago

Experiment worked. A symbolic link in the right place can point to another file and be followed. So that is how we could introduce earth_relief/earth_relief_01mp|g.grd files with links to the gridline version from the directory above.

PaulWessel commented 4 years ago

To get ready for 6.1.0 release, we should do this:

Upload the new sets of earth_relief_xxy_g|p.grd files supporting both pixel and gridline resolutions and place them in a subdirectory called _earthrelief.
Replace the current earth_relief_xxy.grd files with symbolic links to the corresponding files in the earth_relief directory (grid-line versions, except for 15s which is the original pixel).
Add the marbles images (they now report -Rg instead of thousands so they are returned as geo). These should go into subdirectories too. What do you think is the best approach?: 3a. Directory _earthimage, rename all files to earth_day_xxy and earth_night_xxy 3b. Directories _BlueMarble and _BlackMarble, with the Black|Blue_Marble_xxy names 3c. Directories _earthday and _earthnight, with corresponding xxy names 3d. ????
Add new directory _earthage and add the earth_age_xxy_g|p.grd files as soon as EarthByte wraps up their 1x1 min grid. We have no examples of using these so it is OK if these files take longer to be added and then we update docs in 6.1.x.

Please give feedback on this now, @joa-quim and @seisman. I am trying to avoid changes down the road, which will see (hopefully)

Earth gravity data [Sandwell, others]
Other planets (Mars, Moon, ...)
Mechanism for plotting modules to not specify _xxy and have the module auto-compute needed resolution for the given map dimensions and GMT_IMAGE_RESOLUTION = 300 (or whatever).

The @ algorithm will need to learn what is available via gmt_hash_server.txt. BTW, my hash_server file is 0 byes on May 20, so something failed - what does yours say.

PaulWessel commented 4 years ago

I should add one more thing: The SRTM15+v2.1 is a floating point grid with more precision than the integers. This time I wrong out this format =ns+sa which auto-scales the data to fit the full -32767/+32767 range, and hence the precision in the values is about 0.25-0.3 meters instead of 1 meter. This makes the files a bit larger since more different bits. As an example, the full 15s file grows from 2.6 Gb to 3.1 Gb, and the 1 minute goes from 215 Mb to 257 Mb, both about 20% increase. Do you wnat to dumb down to nearest meter and retain the smaller file sizes? I.,e trading 3-4 times the precision for 1.2 times longer download time. I would certainly prefer the higher quality.

PaulWessel commented 4 years ago

How do we handle this scenario:

User runs with old name @earth_relief_30s. We download earth_relief_30s_g.grd because of the link.
Users runs it again. We dont want to download again. So gmt_remove needs to be aware of the deprecated names and look for both that name and the new name.

joa-quim commented 4 years ago

Symbolic links. So user will say @earth_relief_xxy.grd and the link will give him @earth_relief_xxyp.grd, right? But what will happen next time he issues the same command? Will it not download the same file again because earth_relief_xxy.grd doesn't exist in his system?
3b because those are known names whilst earth_day|night_ are unknowns
For the same reason as 3 the files should have their known names *age.xxx.nc``

Ok, I see that you are addressing my first point too

PaulWessel commented 4 years ago

Are you saying drop earth_ from the ages files? FYI, there are actually two files

age.2020.1.GTS2012.1m.nc age.2020.1.GK2007.1m.nc

for two different time-scales. The people who care about these grids are the same peolple who care about the different time-scales... So we may have to do

age_GTS2012_xxy_p|g and age_GK2007_xxy_p|g but we could accept age_xxy_g|p ti sekect GK2007 (I think they prefer that and that is what they only used in the past).

joa-quim commented 4 years ago

I would certainly prefer the higher quality.

Me too

joa-quim commented 4 years ago

Yes, drop the earth_ from name. And if they are both 1m only we don't need the _xxy

PaulWessel commented 4 years ago

You are forgetting 30m, 15m, all the way down to 2m. So xxy is there to stay.

PaulWessel commented 4 years ago

I wonder if this is a better solution:

Leave the current data directory on the server as is. Ubuntu users will still try to access taht in 2023.
Create a new dir with another name than data, e.g., server (to match what we create in the user's directory)
Place all the new subdirs and data we discussed in the new server directory.

If we don't, then we will need to carry much complexity in the gmt_remote.c file to handle these cases:

Remember, any file not listed in the gmt_hash_table is assumed to have been removed and will be DELETED on the user side. Since earth_relief_xxy.grd won't be listed, it will be deleted, then re-downloaded, etc. We don't want that.
When users finally upgrade to 6.1 they move to the new system, and they will get the new data files.

This way the old system will work fine - they wont get the new files until they upgrade (a good argument to do so), and we don't have to deal with legacy file names and complicated checks for old and new file names.

Also remember: We will need to maintain two separate hash tables, one for pre 6.1 and one new ones. GMT 5-6.0 will download the old one with the old files, 6.1 will download the new one with the new files and directory structure.

PaulWessel commented 4 years ago

Yes, drop the earth_ from name.

I will ask EarthByte what they prefer - it is their files.

joa-quim commented 4 years ago

6.1 will download the new one with the new files and directory structure.

But can they still call them @earth_relief_xxy.grd and get the old names, right?

PaulWessel commented 4 years ago

I guess regardless of scheme on the server, we still need to allow for an alias that matches earth_relief_xxy to earth_relief_xxy_g. Seems like we need these features:

6.1: If user requests @earth_relief_xxy then we get @earth_relief_xxy_g and give a message that the name is deprecated and we are giving them the gridline-registered version. We assume the user wants the latest and greatest data. It won't be possible to get the old files from 6.1. Why would anyone want that?
5.x-6.0 If user requests @earth_relief_xxy then we get @earth_relief_xxy from the old directory.
5.x-6.0 If user requests @ earth_relief_xxy_g then they get an error since the code has no idea

.

PaulWessel commented 4 years ago

Will need to test all this. Having a separate server dir means we can test the stuff in a new branch without breaking anything yet.

seisman commented 4 years ago

Leave the current data directory on the server as is. Ubuntu users will still try to access taht in 2023.

Create a new dir with another name than data, e.g., server (to match what we create in the user's directory)

Place all the new subdirs and data we discussed in the new server directory.

I like the idea. In this case, the local file structure is the same as the remote one. Users even can use rsync to manually mirror the dataset.

We will need to maintain two separate hash tables, one for pre 6.1 and one new ones. GMT 5-6.0 will download the old one with the old files, 6.1 will download the new one with the new files and directory structure.

Can we list all files (both the 6.0 and 6.1 data files) in the same gmt_hash_server.txt file. The file would have content like:

173
# list of old files
earth_relief_01d.grd    08871f1e1aa7feb0bb43a259130f74fcea1c54bfe4f6b9988b781b1e362198d4    108278
earth_relief_01m.grd    aa11e643221faef792639c5800fd9ccaa59c7c4e8cac73a17170edb3f4c19086    225267444
AFR.nc  ee581d480ab40b8c196dc1c5a951a05cc577c9b735865036b28ce223d827513f    129281
age.3.20.nc 8c6094015cedfc81bb4cf82e780ffcf709211c13f7b40fefe46a921611ca25af    442404
age_gridline.nc c9cc0f9424eb176cfde037aaf77f98c2713c22bf3afc0a225db04cd11a172b0a    1171167

# list of new files
server/earth_relief/earth_relief_01d_g.grd  08871f1e1aa7feb0bb43a259130f74fcea1c54bfe4f6b9988b781b1e362198d4    108278
server/earth_relief/earth_relief_01d_p.grd  aa11e643221faef792639c5800fd9ccaa59c7c4e8cac73a17170edb3f4c19086    225267444
cache/AFR.nc    ee581d480ab40b8c196dc1c5a951a05cc577c9b735865036b28ce223d827513f    129281
cache/age.3.20.nc   8c6094015cedfc81bb4cf82e780ffcf709211c13f7b40fefe46a921611ca25af    442404
cache/age_gridline.nc   c9cc0f9424eb176cfde037aaf77f98c2713c22bf3afc0a225db04cd11a172b0a    1171167

Does it make gmt_remote.c and backward compatibility easier?

PaulWessel commented 4 years ago

Will have to try and see. There are special pattern checks in gmt_remote in 6.0.0 that will prevent you from downloading the _g|p.grd files for sure.

PaulWessel commented 4 years ago

Currently, we include gmt_datasets.h to rule out invalid remote file names. However, this does not allow us to add more data without requiring a GMT source code update. The options are

Parse the gmt_hash_table for server data sets and make a list from that.
Don't do the check and let @badnamegrid just crash and burn with a curl error

Seems to me the gmt_hashtable is the most useful approach since we know it will be up-to-date (and thus change once we deliver age*). I don't think we even need to change its format (so still work for <=6.0) since the subdirs have the same prefix as the filenames (e.g., earth_relief_04m_p.grd will be in directory earth_relief).

If we do as @seisman says and put both current and new stuff in the same hash table then old behavior should continue just fine, while new usage requiring 6.1 will work its way. The only exceptions we need to handle are:

GMT < 6.0 uses earth_relief_xxy. If we replace that data with a link to the new files, then I think curl download sthe what the links points to and give it the link name; I have verified that this is what happens with command-line curl. So old users will get update data in the guise of the old names. They will not be able to access the pixel version nor marbles or age grids since these are in subdirs that GMT <6.0 does not know about. It is an incentive to upgrade.
GMT > 6.0 uses earth_relief_xxy. Here, we can intervene via the code and give a warning, but unless we also change the name the user would download the new file with the old name.

So I think the only decision is the last one: Do we change the name to the new names when an old name is requested or do we allow users to continue with those names?

PaulWessel commented 4 years ago

A final point: Since the distinction between g and p is lost on the uninformed users, quietly allowing earth_relief_xy is not so bad as it is simpler and already in practice. It begs the question if we should add links age_xxy.grd to point to age/age_GK2007_xxy_g.grd.

OK, one more for @joa-quim : One argument for calling the files earth_age is that when the work we proposed for NASA (whether funded or not) happens, users wishing to make a map can choose not to give _xxy at all. I think it is just to unspecific to say

gmt grdimage @age -B -map pdf

and would strongly prefer

gmt grdimage @earth_age -B map pdf

I know there is no mars_age but we will do earth_gravity, moon_gravity, etc. This is why I suggested and still suggest we dont use BlueMarble etc names that are not very specific unless you know what that means. earth_day|night is much more generic and when new data comes out taht is not called BlueMarble then we dont have to agonize over a bad naming choice.

PaulWessel commented 4 years ago

I have tested the symbolic links on the 01d grid: Removed the old grid, added symbolic link by same name pointing to server/earth_relief/earth_relief_01d_g.grd. THen on local machine removed the downloaded file and tried gmt grdinfo @earth_relief_01d.grd. Worked like a charm and is written with the old name. So I think I can remove all the old files and set those symbolic links. Any objections, @joa-quim or @seisman ? I know it is Friday so you may not be glued to your monitor but I am, so doing this within the hour.

GenericMappingTools / gmtserver-admin

Organization of server data #37