Open rhugonnet opened 4 years ago
Can also do this for successful extraction:
ogr2ogr -sql "SELECT * FROM \"05_rgi60_GreenlandPeriphery\" WHERE Name='Dendritgletscher'" out.shp 05_rgi60_GreenlandPeriphery.shp
But naming future versions with rgi70... is still a good idea.
Or even:
ogr2ogr -where "Name='Dendritgletscher'" out.shp 05_rgi60_GreenlandPeriphery.shp
Can also do this for successful extraction:
ogr2ogr -sql "SELECT * FROM \"05_rgi60_GreenlandPeriphery\" WHERE Name='Dendritgletscher'" out.shp 05_rgi60_GreenlandPeriphery.shp
But naming future versions with rgi70... is still a good idea.
Good to know, thanks a lot Bruce! I remember searching for a while at the time, and couldn't find this way of doing it elsewhere.
Yes, changing is probably for the best. For example, when there is varying shapefile names involved (e.g. mass processing through Python/OGR bindings) and in a script that requires SQL statements (for rasterization of specific features for instance), it is a bit tricky to do an exception loop because of this naming.
I agree with the suggestion of choosing names that don't need to be quoted in SQL. However, why insert version numbers into filenames? Thinking ahead, this guarantees breaking changes to any code written to process an earlier version using hardcoded file names.
A few more nitpicks regarding file names that introduce friction for programmatic access:
rgi_code
rgi_code
and a name that does not match the first-order region name
Constructing a filename dynamically should be as simple as knowing the region code. If the integer region rgi_code
is not human-readable enough (e.g. region_5.shp
), how about using a code that is both human and machine readable (replacing or in addition to the current rgi_code
)? Something like greenland_periphery
? So we have:
rgi60
regions
region.shp
(e.g. {id: 5, code: 'greenland_periphery', name: 'Greenland Periphery'})subregion.shp
glaciers_by_region
greenland_periphery.shp
(i.e. <region.code>.shp
)p.s. Shouldn't region 5 be called 'Greenland', since it contains subregions 'Greenland Periphery' and 'Greenland Ice Sheet'?
I fully agree! I'll get back to this issue when we are ready to work on RGI7 beta (alpha is taking all my energy at the moment).
p.s. Shouldn't region 5 be called 'Greenland', since it contains subregions 'Greenland Periphery' and 'Greenland Ice Sheet'?
Yeah this is unfortunate, because the 'Greenland Ice Sheet' subregion does not contain any glacier. Moving forward, we have discussed in the committee that RGI5 should have better subregions, e.g. based on Rastner et al fig 1. We postponed this decision to after RGI7.1
@ezwelty I am back to this issue now that it has become more urgent (we are in the process of moving the RGI to a new server with the opportunity to change the file names. I'll send you the working document here when ready, but you make two arguments that I'd like to clarify:
zero-padded first-order region rgi_code
Assuming we keep the region numbers in the name (I think we should), zero padding is necessary to keep the files sorted properly (in the folder display, and in code if we manipulate strings).
In fact, my suggestion will be to handle all region ids as strings, in all occurrences (currently they are sometimes strings sometimes not.
The positives:
01-02, 01-03
etcThe negatives:
combined use of rgi_code and a name that does not match the first-order region name
yeah, I've always hated that as well. Currently I'm leaning towards using only the region ids (01, 02, ...). I like you human readable solution but I still think the files should be ordered in a folder somehow, and alphabetically won't make it after people have been really used to the region numbers (in many, many tables across many many papers).
What are your thoughts on these two points?
@fmaussion I don't have a strong preference between region id '05'
or 5
– as long as it is consistent in the data and filenames (e.g. regions.shp
: {id: '05', ...} -> 05.shp
), which sounds like is what you are suggesting.
I'm less convinced that using only region ids in the filenames is the best idea, since it would look something like this:
/regions
regions.shp
/glaciers_by_region
01.shp
02.shp
Not particularly human-friendly, since it requires separately looking up region id by region name. If you consider the order of the region ids meaningful, and the ids themselves canon, then a human/machine compromise could be to include both the id and the code in the filename:
<id>_<code>.shp
(where {id: '05', code: 'greenland_periphery', name: 'Greenland Periphery'})
@ezwelty understood - basically we will ship with a metadata file telling machines how to read the files.
Regarding another point you made (removing the version number from the filename) is unfortunately not possible, because the region files can also be downloaded independently (I wanted to avoid this but the steering committee is against that). I think their name needs to be self-explanatory.
How to add the version number to the file names is a bit of a headache though, more on this later.
basically we will ship with a metadata file telling machines how to read the files.
I imagine the mapping of region ids to codes and pretty-formatted name would be in whatever data file contains the region geometries (regions.shp
, regions.geojson
, etc..).
the region files can also be downloaded independently
Couldn't that be downloaded as e.g. rgi70-regions.zip
that unpacks to rgi70-regions/regions.shp
, rgi70-regions/subregions.shp
(i.e. version in top directory only)? Or maybe I misunderstood.
Sorry I meant e.g. https://www.glims.org/RGI/rgi60_files/11_rgi60_CentralEurope.zip which is only one region of the RGI, while https://www.glims.org/RGI/rgi60_files/00_rgi60.zip is the complete RGI (a zip of zipfiles).
Your suggestion actually makes me wonder (re: meme below). I'm a big proponent of having the zip file and the shapes in it having strictly the same name (python and qgis can read from zip files - qgis doesn't care about the name being the same, but in python you need to infer the name of the .shp file from parsing the zip or "guess it", which is easy if all files have the same name. I take from your suggestion that you do not care about that?
Hi both. I agree with pretty much all, a couple comments though:
01.shp
, 02.shp
, see my original comment and Bruce's answers at the top: SQL does not like files starting with numbers and this requires a bit of tweaking. A lot of people still use GDAL/OGR in command line, and that would limit accessibility.yeah I'm working on this consistency across versions as we speak, and it's only partly funny ;-)
For the versioning system, the current standpoint is that version 7 will be the last version of RGI at year 2000. Future iterations will be 7.1. 7.2, etc. RGI v8 will then be targetting another reference year.
SQL does not like files starting with numbers and this requires a bit of tweaking.
Ah right.
Sorry I meant e.g. https://www.glims.org/RGI/rgi60_files/11_rgi60_CentralEurope.zip
Hmm, well if every file needs to stand on its own, then I guess I replace my nested file structure suggestion for an equivalent flat file structure. Something like rgi<version>-<subject>-<partition>
, where -
separates each component, and each component can only contain the characters a-z
, 0-9
, and _
(very conservative, for cross-platform support).
rgi<version>
: All the RGI.rgi<version>-regions
: First-order regions.rgi<version>-subregions
: Second-order regions.rgi<version>-glaciers-<region_id>-<region_code>
(or rgi<version>-glaciers_by_region-<region_id>-<region_code>
): Glaciers in first-order region id
(aka region code
).To avoid ambiguous version tags in the future (e.g. is 112
= 1.12 or 11.2?), it might be worth instead using rgi<major>-<minor>
or even rgi<target_year>-<release>
.
As for format, I don't care for shapefiles (proprietary, multiple files, attribute name limit, no NULL), so if it were up to me, shapefiles would all be replaced by OGC GeoPackage (or GeoJSON for lightweight applications). But I get that the shapefile remains the de-facto standard...
To avoid ambiguous version tags in the future (e.g. is 112 = 1.12 or 11.2?), it might be worth instead using
rgi<major>-<minor>
Why not using a dot? rgi<major>.<minor>
This is all what github offers as download, so I thought windows would be fine with it?
or even
rgi<target_year>-<release>
.
CalVer for the win! But I think this ship has sailed.
I'll cede you the dot ;) These days, .
should be a very safe addition to the very conservative character set I listed above. As long as it isn't the first or last character in a file/folder name, and only used in file names with a later dot denoting the file extension.
@ezwelty @rhugonnet if you feel like it, here is my current write up: https://docs.google.com/document/d/1pOJURM_jmkX2L2fnUj0dLVqSuwpcqDrsMzc9N3METkU/edit?usp=sharing
This will be discussed at length tonight, but won't be set in stone right away, so your input is welcome (open for comments)
OK so we will discuss this (at length) over the coming weeks (first meeting today 6pm), but it you have a minute please have a look at document, section 3.1. Both of you seem in favor of a "target year" versioning system - I've added your arguments to the document, let me know if something's missing.
Fixed in RGI 7.0! Files start with RGI2000 ...
RGI version RGI v6
Describe the issue RGI shapefiles start with a number (e.g., 05_rgi60_GreenlandPeriphery.shp) which is not supported in SQL dialect for selecting specific features or using specific conditions (can it be through bash, Python or other). This issue might not be obvious to all user (unfamiliar with SQL) that thus fail to manipulate the shapefile.
Screenshots Trying to extract Dendritgletscher glacier outline from the Greenland Periphery shapefile.
Using a shapefile with layer renamed (ogr2ogr out.shp in.shp):
Polygon is extracted successfully:
Suggested solution In future RGI versions, provide a naming that does not start with a number (e.g., rgi70_05_GreenlandPeriphery.shp)