Naming of RGI shapefiles and SQL syntax

rhugonnet commented 4 years ago

RGI version RGI v6

Describe the issue RGI shapefiles start with a number (e.g., 05_rgi60_GreenlandPeriphery.shp) which is not supported in SQL dialect for selecting specific features or using specific conditions (can it be through bash, Python or other). This issue might not be obvious to all user (unfamiliar with SQL) that thus fail to manipulate the shapefile.

Screenshots Trying to extract Dendritgletscher glacier outline from the Greenland Periphery shapefile.

Using a shapefile with layer renamed (ogr2ogr out.shp in.shp):

Polygon is extracted successfully:

Suggested solution In future RGI versions, provide a naming that does not start with a number (e.g., rgi70_05_GreenlandPeriphery.shp)

bruceraup commented 4 years ago

Can also do this for successful extraction:

ogr2ogr -sql "SELECT * FROM \"05_rgi60_GreenlandPeriphery\" WHERE Name='Dendritgletscher'" out.shp 05_rgi60_GreenlandPeriphery.shp

But naming future versions with rgi70... is still a good idea.

bruceraup commented 4 years ago

Or even:

ogr2ogr -where "Name='Dendritgletscher'" out.shp 05_rgi60_GreenlandPeriphery.shp

rhugonnet commented 4 years ago

Can also do this for successful extraction:

ogr2ogr -sql "SELECT * FROM \"05_rgi60_GreenlandPeriphery\" WHERE Name='Dendritgletscher'" out.shp 05_rgi60_GreenlandPeriphery.shp

But naming future versions with rgi70... is still a good idea.

Good to know, thanks a lot Bruce! I remember searching for a while at the time, and couldn't find this way of doing it elsewhere.

Yes, changing is probably for the best. For example, when there is varying shapefile names involved (e.g. mass processing through Python/OGR bindings) and in a script that requires SQL statements (for rasterization of specific features for instance), it is a bit tricky to do an exception loop because of this naming.

ezwelty commented 2 years ago

I agree with the suggestion of choosing names that don't need to be quoted in SQL. However, why insert version numbers into filenames? Thinking ahead, this guarantees breaking changes to any code written to process an earlier version using hardcoded file names.

A few more nitpicks regarding file names that introduce friction for programmatic access:

mixed use of PascalCase and snake_case
zero-padded first-order region rgi_code
combined use of rgi_code and a name that does not match the first-order region name

Constructing a filename dynamically should be as simple as knowing the region code. If the integer region rgi_code is not human-readable enough (e.g. region_5.shp), how about using a code that is both human and machine readable (replacing or in addition to the current rgi_code)? Something like greenland_periphery? So we have:

rgi60
- regions
- region.shp (e.g. {id: 5, code: 'greenland_periphery', name: 'Greenland Periphery'})
- subregion.shp
- glaciers_by_region
- greenland_periphery.shp (i.e. <region.code>.shp)
- ...

p.s. Shouldn't region 5 be called 'Greenland', since it contains subregions 'Greenland Periphery' and 'Greenland Ice Sheet'?

fmaussion commented 2 years ago

I fully agree! I'll get back to this issue when we are ready to work on RGI7 beta (alpha is taking all my energy at the moment).

p.s. Shouldn't region 5 be called 'Greenland', since it contains subregions 'Greenland Periphery' and 'Greenland Ice Sheet'?

Yeah this is unfortunate, because the 'Greenland Ice Sheet' subregion does not contain any glacier. Moving forward, we have discussed in the committee that RGI5 should have better subregions, e.g. based on Rastner et al fig 1. We postponed this decision to after RGI7.1

fmaussion commented 2 years ago

@ezwelty I am back to this issue now that it has become more urgent (we are in the process of moving the RGI to a new server with the opportunity to change the file names. I'll send you the working document here when ready, but you make two arguments that I'd like to clarify:

zero-padded first-order region rgi_code

Assuming we keep the region numbers in the name (I think we should), zero padding is necessary to keep the files sorted properly (in the folder display, and in code if we manipulate strings).

In fact, my suggestion will be to handle all region ids as strings, in all occurrences (currently they are sometimes strings sometimes not.

The positives:

it makes the zero-padding meaningful
it allows to merge O1 with O2 region ids as 01-02, 01-03 etc
it is consistent across files names and data
numbers convey an order (e.g. when you do a plot), but in fact the ids are arbitrary. IDs as strings feel more intuitive.
zero padding is prettier in columns, text tables, etc.

The negatives:

storing region numbers in csv files is not so practical since many tools will convert them to ints
probably this takes a bit more space in shapefiles
if you don't like zero padding then this is not for you

combined use of rgi_code and a name that does not match the first-order region name

yeah, I've always hated that as well. Currently I'm leaning towards using only the region ids (01, 02, ...). I like you human readable solution but I still think the files should be ordered in a folder somehow, and alphabetically won't make it after people have been really used to the region numbers (in many, many tables across many many papers).

What are your thoughts on these two points?

ezwelty commented 2 years ago

@fmaussion I don't have a strong preference between region id '05' or 5 – as long as it is consistent in the data and filenames (e.g. regions.shp: {id: '05', ...} -> 05.shp), which sounds like is what you are suggesting.

I'm less convinced that using only region ids in the filenames is the best idea, since it would look something like this:

/regions
- regions.shp
/glaciers_by_region
- 01.shp
- 02.shp
- ...

Not particularly human-friendly, since it requires separately looking up region id by region name. If you consider the order of the region ids meaningful, and the ids themselves canon, then a human/machine compromise could be to include both the id and the code in the filename:

<id>_<code>.shp (where {id: '05', code: 'greenland_periphery', name: 'Greenland Periphery'})

fmaussion commented 2 years ago

@ezwelty understood - basically we will ship with a metadata file telling machines how to read the files.

Regarding another point you made (removing the version number from the filename) is unfortunately not possible, because the region files can also be downloaded independently (I wanted to avoid this but the steering committee is against that). I think their name needs to be self-explanatory.

How to add the version number to the file names is a bit of a headache though, more on this later.

ezwelty commented 2 years ago

basically we will ship with a metadata file telling machines how to read the files.

I imagine the mapping of region ids to codes and pretty-formatted name would be in whatever data file contains the region geometries (regions.shp, regions.geojson, etc..).

the region files can also be downloaded independently

Couldn't that be downloaded as e.g. rgi70-regions.zip that unpacks to rgi70-regions/regions.shp, rgi70-regions/subregions.shp (i.e. version in top directory only)? Or maybe I misunderstood.

fmaussion commented 2 years ago

Sorry I meant e.g. https://www.glims.org/RGI/rgi60_files/11_rgi60_CentralEurope.zip which is only one region of the RGI, while https://www.glims.org/RGI/rgi60_files/00_rgi60.zip is the complete RGI (a zip of zipfiles).

Your suggestion actually makes me wonder (re: meme below). I'm a big proponent of having the zip file and the shapes in it having strictly the same name (python and qgis can read from zip files - qgis doesn't care about the name being the same, but in python you need to infer the name of the .shp file from parsing the zip or "guess it", which is easy if all files have the same name. I take from your suggestion that you do not care about that?

rhugonnet commented 2 years ago

Hi both. I agree with pretty much all, a couple comments though:

Having the version number in the files may not be a big issue for the RGI, because versions are not released so often (hopefully RGIv8 will not be needed for a bit :wink:). Making the positioning/nomenclature of the version number consistent across future RGI versions, starting from RGIv7, would be quite useful though.
On the subfiles 01.shp, 02.shp, see my original comment and Bruce's answers at the top: SQL does not like files starting with numbers and this requires a bit of tweaking. A lot of people still use GDAL/OGR in command line, and that would limit accessibility.

fmaussion commented 2 years ago

yeah I'm working on this consistency across versions as we speak, and it's only partly funny ;-)

For the versioning system, the current standpoint is that version 7 will be the last version of RGI at year 2000. Future iterations will be 7.1. 7.2, etc. RGI v8 will then be targetting another reference year.

ezwelty commented 2 years ago

SQL does not like files starting with numbers and this requires a bit of tweaking.

Ah right.

Sorry I meant e.g. https://www.glims.org/RGI/rgi60_files/11_rgi60_CentralEurope.zip

Hmm, well if every file needs to stand on its own, then I guess I replace my nested file structure suggestion for an equivalent flat file structure. Something like rgi<version>-<subject>-<partition>, where - separates each component, and each component can only contain the characters a-z, 0-9, and _ (very conservative, for cross-platform support).

rgi<version>: All the RGI.
rgi<version>-regions: First-order regions.
rgi<version>-subregions: Second-order regions.
rgi<version>-glaciers-<region_id>-<region_code> (or rgi<version>-glaciers_by_region-<region_id>-<region_code>): Glaciers in first-order region id (aka region code).

To avoid ambiguous version tags in the future (e.g. is 112 = 1.12 or 11.2?), it might be worth instead using rgi<major>-<minor> or even rgi<target_year>-<release>.

As for format, I don't care for shapefiles (proprietary, multiple files, attribute name limit, no NULL), so if it were up to me, shapefiles would all be replaced by OGC GeoPackage (or GeoJSON for lightweight applications). But I get that the shapefile remains the de-facto standard...

fmaussion commented 2 years ago

To avoid ambiguous version tags in the future (e.g. is 112 = 1.12 or 11.2?), it might be worth instead using rgi<major>-<minor>

Why not using a dot? rgi<major>.<minor> This is all what github offers as download, so I thought windows would be fine with it?

or even rgi<target_year>-<release>.

CalVer for the win! But I think this ship has sailed.

ezwelty commented 2 years ago

I'll cede you the dot ;) These days, . should be a very safe addition to the very conservative character set I listed above. As long as it isn't the first or last character in a file/folder name, and only used in file names with a later dot denoting the file extension.

fmaussion commented 2 years ago

@ezwelty @rhugonnet if you feel like it, here is my current write up: https://docs.google.com/document/d/1pOJURM_jmkX2L2fnUj0dLVqSuwpcqDrsMzc9N3METkU/edit?usp=sharing

This will be discussed at length tonight, but won't be set in stone right away, so your input is welcome (open for comments)

fmaussion commented 2 years ago

OK so we will discuss this (at length) over the coming weeks (first meeting today 6pm), but it you have a minute please have a look at document, section 3.1. Both of you seem in favor of a "target year" versioning system - I've added your arguments to the document, let me know if something's missing.

fmaussion commented 1 year ago

Fixed in RGI 7.0! Files start with RGI2000 ...

GLIMS-RGI / rgi_issue_tracker

Naming of RGI shapefiles and SQL syntax #9