Closed odscrachel closed 1 year ago
DCAT has a spatial/geographic coverage property, which we should check our modelling against.
Originally we considered DCAT standard for framing the fields; the idea of the location attribute was precisely this:
However this should be secondary to the countries specification.
They suggest to use Geoname url instead of simple location name. It might be a good idea; alternative would be to use subnational unit codes, but they thend to change over time.
They use a specific class for location attributes:
Should we do the same? Or just a centroid point coordinate field?
The structure proposed by @odscrachel makes sense. I think users are better served with ISO code for countries as suggested and name of the region or location for subnational level, rather than a code or url.
Seems sensible to recommend using names as provided in geonames or similar as a reference for region naming, but I think OK to recommend rather than prescribe this.
Question: can we include a string of multiple named regions at subnational level?
countries
as an array of ISO-3166-1 alpha-2 codes looks good.
The problems with modelling subnational_coverage
as a list of names without enforcing a particular codelist are that the names can't be validated and consuming applications and users can't reliably use names to look up translations and other information such as the population counts etc.
As for countries
, ideally, we would choose an authoritative codelist for subnational_coverage
and leave looking up code labels to consuming applications. For example, if we used ISO 3166-2, a dataset covering Kampala and Wakiso would look like this:
{
"id": "1",
"countries": [
"UG"
],
"subnational_coverage": [
"UG-102",
"UG-113"
]
}
This is a good approach because:
id countries subnational_coverage 1 UG UG-102,UG-103
However, the challenges to this approach are:
A more flexible approach would be to allow publishers to use their own classifications, to disclose the gazetteer from which the names are drawn, and to include the code's labels, e.g. a publisher using ISO 3166-2 would look like this:
{
"id": "1",
"countries": [
"UG"
],
"subnational_coverage": [
{
"id": "UG-102",
"scheme": "ISO-3166-2",
"description": "Kampala"
},
{
"id": "UG-113",
"scheme": "ISO-3166-2",
"description": "Wakiso"
}
]
}
Another publisher could use Geonames:
{
"id": "1",
"countries": [
"UG"
],
"subnational_coverage": [
{
"description": "Kampala",
"id": "232422",
"scheme": "GEONAMES",
"uri": "https://sws.geonames.org/232422"
},
{
"description": "Wakiso",
"id": "448224",
"scheme": "GEONAMES",
"uri": "https://sws.geonames.org/4482242"
}
]
}
We could allow publishers to 'fall back' to only providing names:
{
"id": "1",
"countries": [
"UG"
],
"subnational_coverage": [
{
"description": "Kampala"
},
{
"description": "Wakiso"
}
]
}
However, there are some challenges with this approach, too:
scheme
, the data can't be validated
id countries 1 UG
id subnational_coverage/0/description 1 Kampala 1 Wakiso
If we choose to target DCAT conformance, then, in addition to countries
and subnational_coverage
, I think we would want to introduce a spatial
object:
Field | Description |
---|---|
spatial | The geographical area covered by the dataset. |
spatial.gazetteer | An entry from a geographical index or directory representing the spatial area. |
spatial.bbox | A geographic bounding box delimiting the spatial area. |
spatial.geometry | A set of coordinates denoting the vertices of the spatial area. |
spatial.centroid | The coordinates of the centre of the spatial area. |
{
"spatial": {
"gazetteer": {
"id": "232422",
"scheme": "GEONAMES",
"description": "Kampala",
"uri": "https://sws.geonames.org/232422"
},
"bbox": [
-10.0,
-10.0,
10.0,
10.0
],
"geometry": {
"type": "Polygon",
"coordinates": [
[
[
-10.0,
-10.0
],
[
10.0,
-10.0
],
[
10.0,
10.0
],
[
-10.0,
-10.0
]
]
]
},
"centroid": [
0,
0
]
}
}
The flexible approach using name/geonames would be the best on the user side probably. Adding the spatial attributes could resolve any possible misinterpretation. Do we need to validate this input?
ISO codes for subnational level: that should be a standard, but is it really? I tend to always add those fields, but often end up using datasets that apply different codelists, or older iso codes that changed meanwhile. Also, these are limited to ADM lev1, while data might go down to ADM2, 3 or 4.
Agree, validation at this level is not usually needed, the biggest requirement is being descriptive about what areas are covered so flexible approach is preferred. I think the gazetteer entry under DCAT suffices, I don't think we want to complete users to include bbox, centroid for every ADM1 unit in the analysis - bbox and centroid for an event footprint or overall analysis domain is useful, but to describe every ADM unit in the analysis is asking too much and not useful.
Agree, validation at this level is not usually needed, the biggest requirement is being descriptive about what areas are covered so flexible approach is preferred. I think the gazetteer entry under DCAT suffices, I don't think we want to complete users to include bbox, centroid for every ADM1 unit in the analysis - bbox and centroid for an event footprint or overall analysis domain is useful, but to describe every ADM unit in the analysis is asking too much and not useful.
Agree, too much detail to add if this applies to each individual location/resources; requires spatial tooling to get the bbox or centroid. I believe the BBox related to the main dataset can be automatically shown in consuming app based on country ISO-a2, as in the current DDH?
Yes, as I mentioned in https://github.com/GFDRR/rdl-standard/issues/52#issuecomment-1562089134, ISO 3166-2 is only for principle subdivisions (i.e. ADM-1) and different publishers might use different codelists anyway. As such, I agree that the flexible approach is preferable, even if it comes with some added complexity compared to mandating a particular scheme.
Regarding validation, given that the flexible approach is preferred, the ideal scenario would be to choose a recommended scheme, and implement validation of subnational_coverage.id
and subnational_coverage.name
when that scheme is declared in subnational_coverage.scheme
. That type of validation is not supported by JSON Schema so we would need to implement an additional validation check (not a problem - we do similar things in different versions of CoVE for several other standards). However, depending on the chosen scheme it might be complex to get the codelist to validate against and to keep it up to date. For example, to get a machine-readable version of the ISO-3166-2 codelist, we would either need to buy the country codes collection from ISO ($330 USD per annum), scrape the data from the Online Browsing Platform (not sure of licensing implications) or use something like the iso-3166 node package (not sure how reliable that is).
So maybe we don't worry about validation and leave it up to consuming applications. We can reuse the location gazetteers codelist from OCDS as an open codelist for .scheme
so that consuming applications can reliably identify the gazetteer from which the identifiers are drawn.
Regarding changes, if a scheme never reuses old codes for new geographies, then we can omit any version identifier from scheme
. If a scheme reuses old codes for new geographies, then we'd need to include a version identifier.
I don't think we want to complete users to include bbox, centroid for every ADM1 unit in the analysis - bbox and centroid for an event footprint or overall analysis domain is useful, but to describe every ADM unit in the analysis is asking too much and not useful.
Agreed, that's in line with my proposal in https://github.com/GFDRR/rdl-standard/issues/52#issuecomment-1562089134, which has spatial
as an object (not an array) at the dataset level so just one .gazetteer
, .bbox
, .geometry
and .centroid
per dataset. If, as in that proposal, we only had the spatial
object then .gazetteer
would need to describe the overall spatial coverage of the dataset, i.e. if a dataset covered both Kampala and Wakiso there would need to be a suitable gazetteer entry covering both regions. I don't think such entries exist, hence having the dataset-level subnational_coverage
field, which is also a list of gazetteer entries, to provide for listing more than one subnational region.
On reflection, I think it might be cleaner to put everything under spatial
and to make spatial.gazetteers
an array. That will also make for easier reuse of spatial
when we want to describe spatial coverage at other levels in the standard.
Field | Title | Description | Type | Format |
---|---|---|---|---|
spatial |
Spatial coverage | The geographical area covered by the dataset. | object | |
spatial.countries |
Countries | The countries covered by the geographical area, from the open country codelist (ISO 3166-1). | array (string) | |
spatial.gazetteerEntries |
Gazetteer entries | Entries from geographical indices or directories describing the geographical area. This field should be used to describe subnational coverage. Use of is recommended. | array (object) | |
spatial.gazetteerEntires.id |
Identifier | An identifier drawn from the gazetteer identified in .scheme . |
string | |
spatial.gazetteerEntries.scheme |
Scheme | The gazetteer from which the entry is drawn, from the open locationGazetteers codelist. | string | |
spatial.gazetteerEntries.description |
Description | A description for the gazetteer entry. | string | iri |
spatial.gazetteerEntries.uri |
Uniform resource locator | A URI for the gazetteer entry. | string | |
spatial.bbox |
Bounding box | A geographic bounding box delimiting the geographical area. | array (number) | |
spatial.geometry |
Geometry | A set of coordinates denoting the vertices of the geographical area. | ||
spatial.geometry.type |
Type | The GeoJSON geometry type that is described by .coordinates , from the closed geometryType codelist. |
string | |
spatial.geometry.coordinates |
Coordinates | One or more GeoJSON positions according to the GeoJSON geometry type defined in .type . |
array (number, array (number)) | |
spatial.centroid |
Centroid | The coordinates of the centre of the geographical area. | array (number) |
Category | Code | Title | Description | Source | URI Pattern |
---|---|---|---|---|---|
Subnational | ISO 3166-2 | ISO Country Subdivision Codes | ISO codes for identifying the principal subdivisions (e.g. provinces or states) of all countries coded in ISO 3166-1. | https://www.iso.org/standard/72483.html | |
Subnational | NUTS | EU Nomenclature of Territorial Units for Statistics | The Nomenclature of Territorial Units for Statistics (NUTS) was established by Eurostat in order to provide a single uniform breakdown of territorial units for the production of regional statistics for the European Union. | https://ec.europa.eu/eurostat/web/nuts/linked-open-data | http://data.europa.eu/nuts/code/ |
National | ISO 3166-1 alpha-2 | ISO Country Codes | ISO 2-Digit Country Codes | https://www.iso.org/iso-3166-country-codes.html | |
Universal | GEONAMES | GeoNames | GeoNames provides numerical identifiers for many points of interest around the world, including administrative divisions, populated centres and other locations, embedded within a structured tree of geographic relations. | https://www.geonames.org/ | https://www.geonames.org/ |
Universal | OSMN | OpenStreetMap Node | OpenStreetMap Nodes consist of a single point in space defined by a latitude, longitude and node ID. Nodes might have tags to indicate the particular geographic feature they represent. | https://www.openstreetmap.org/node/ | |
Universal | OSMR | OpenStreetMap Relation | Relations are used to model logical (and usually local) or geographic relationships between objects. In practice, boundaries of geographic areas are available as Relations in OpenStreetMap. | https://wiki.openstreetmap.org/wiki/Relation | https://www.openstreetmap.org/relation/ |
Code | Title | Description | Source |
---|---|---|---|
Point | Point | The 'coordinates' member is a single position. | https://tools.ietf.org/html/rfc7946#section-3.1 |
MultiPoint | MultiPoint | The 'coordinates' member is an array of positions. | https://tools.ietf.org/html/rfc7946#section-3.1 |
LineString | LineString | The 'coordinates' member is an array of two or more positions. | https://tools.ietf.org/html/rfc7946#section-3.1 |
MultiLineString | MultiLineString | The 'coordinates' member is an array of LineString coordinate arrays. | https://tools.ietf.org/html/rfc7946#section-3.1 |
Polygon | Polygon | The 'coordinates' member must be an array of linear ring coordinate arrays. | https://tools.ietf.org/html/rfc7946#section-3.1 |
MultiPolygon | MultiPolygon | The 'coordinates' member is an array of Polygon coordinate arrays. | https://tools.ietf.org/html/rfc7946#section-3.1 |
Somewhat pathological because it mixes gazetteer schemes. I don't think we'd expect to see that in practice.
{
"spatial": {
"countries": [
"UG"
],
"gazetteerEntries": [
{
"id": "UG-102",
"scheme": "ISO-3166-2",
"description": "Kampala"
},
{
"id": "448224",
"scheme": "GEONAMES",
"description": "Wakiso",
"uri": "https://sws.geonames.org/4482242"
}
],
"bbox": [
-10.0,
-10.0,
10.0,
10.0
],
"geometry": {
"type": "Polygon",
"coordinates": [
[
[
-10.0,
-10.0
],
[
10.0,
-10.0
],
[
10.0,
10.0
],
[
-10.0,
-10.0
]
]
]
},
"centroid": [
0,
0
]
}
}
{
"spatial": {
"countries": [
"UG"
],
"gazetteerEntries": [
{
"description": "Kampala"
},
{
"description": "Wakiso",
}
]
}
If that all sounds good, the remaining question is which subnational gazetteer to recommend. Given the context in @matamadio's last update that catalog tooling already supports ISO 3166-2, I think that should be the recommendation. On the understanding that other gazetteers can be used if a publisher wants to express more granular administrative levels, or if they just want to use a different codelist for some reason. We can perhaps include something in the documentation to that effect.
Thanks, agree on the plan and on ISO 3166-2.
closed by #105
What is the context or reason for the change?
There is a need to determine the location covered by the data e.g. "Afghanistan", "Kampala", "South-East Asia". This can include multi-country, regions or sub-country regions.
What is your proposed change?
Proposal is to create a
countries
field (ISO 3166-2) to list all the countries covered by the dataset and to renamegeo_coverage
tosubnational_coverage
with the description ‘Locations covered by the data at a sub-national level, e.g. specific cities or states.’subnational_coverage
would be an optional field.Why is this not covered by the existing model?
There is currently a
geo_coverage
field to give the ‘ISO codes of countries covered by the dataset.’Can you provide an example?