GFDRR / rdl-standard

The Risk Data Library Standard (RDLS) is an open data standard to make it easier to work with disaster and climate risk data. It provides a common description of the data used and produced in risk assessments, including hazard, exposure, vulnerability, and modelled loss, or impact, data.
https://docs.riskdatalibrary.org/
Creative Commons Attribution Share Alike 4.0 International
16 stars 1 forks source link

[Proposal] Location #52

Closed odscrachel closed 1 year ago

odscrachel commented 1 year ago

What is the context or reason for the change?

There is a need to determine the location covered by the data e.g. "Afghanistan", "Kampala", "South-East Asia". This can include multi-country, regions or sub-country regions.

What is your proposed change?

Proposal is to create a countries field (ISO 3166-2) to list all the countries covered by the dataset and to rename geo_coverage to subnational_coverage with the description ‘Locations covered by the data at a sub-national level, e.g. specific cities or states.’ subnational_coverage would be an optional field.

Why is this not covered by the existing model?

There is currently a geo_coverage field to give the ‘ISO codes of countries covered by the dataset.’

Can you provide an example?


"countries": [
  "UG"
],
"subnational_coverage": [
  "Kampala"
],
duncandewhurst commented 1 year ago

DCAT has a spatial/geographic coverage property, which we should check our modelling against.

matamadio commented 1 year ago

Originally we considered DCAT standard for framing the fields; the idea of the location attribute was precisely this:

immagine However this should be secondary to the countries specification.

They suggest to use Geoname url instead of simple location name. It might be a good idea; alternative would be to use subnational unit codes, but they thend to change over time.

They use a specific class for location attributes:

immagine

Should we do the same? Or just a centroid point coordinate field?

stufraser1 commented 1 year ago

The structure proposed by @odscrachel makes sense. I think users are better served with ISO code for countries as suggested and name of the region or location for subnational level, rather than a code or url.

Seems sensible to recommend using names as provided in geonames or similar as a reference for region naming, but I think OK to recommend rather than prescribe this.

Question: can we include a string of multiple named regions at subnational level?

duncandewhurst commented 1 year ago

Countries

countries as an array of ISO-3166-1 alpha-2 codes looks good.

Subnational coverage

The problems with modelling subnational_coverage as a list of names without enforcing a particular codelist are that the names can't be validated and consuming applications and users can't reliably use names to look up translations and other information such as the population counts etc.

As for countries, ideally, we would choose an authoritative codelist for subnational_coverage and leave looking up code labels to consuming applications. For example, if we used ISO 3166-2, a dataset covering Kampala and Wakiso would look like this:

{
  "id": "1",
  "countries": [
    "UG"
  ],
  "subnational_coverage": [
    "UG-102",
    "UG-113"
  ]
}

This is a good approach because:

id countries subnational_coverage
1 UG UG-102,UG-103

However, the challenges to this approach are:

A more flexible approach would be to allow publishers to use their own classifications, to disclose the gazetteer from which the names are drawn, and to include the code's labels, e.g. a publisher using ISO 3166-2 would look like this:

{
  "id": "1",
  "countries": [
    "UG"
  ],
  "subnational_coverage": [
    {
      "id": "UG-102",
      "scheme": "ISO-3166-2",
      "description": "Kampala"      
    },
    {
      "id": "UG-113",
      "scheme": "ISO-3166-2",
      "description": "Wakiso"      
    }
  ]
}

Another publisher could use Geonames:

{
  "id": "1",
  "countries": [
    "UG"
  ],
  "subnational_coverage": [
    {
      "description": "Kampala",
      "id": "232422",
      "scheme": "GEONAMES",
      "uri": "https://sws.geonames.org/232422"
    },
    {
      "description": "Wakiso",
      "id": "448224",
      "scheme": "GEONAMES",
      "uri": "https://sws.geonames.org/4482242"
    }
  ]
}

We could allow publishers to 'fall back' to only providing names:

{
  "id": "1",
  "countries": [
    "UG"
  ],
  "subnational_coverage": [
    {
      "description": "Kampala"
    },
    {
      "description": "Wakiso"
    }
  ]
}

However, there are some challenges with this approach, too:

id countries
1 UG
id subnational_coverage/0/description
1 Kampala
1 Wakiso

DCAT conformance

If we choose to target DCAT conformance, then, in addition to countries and subnational_coverage, I think we would want to introduce a spatial object:

Field Description
spatial The geographical area covered by the dataset.
spatial.gazetteer An entry from a geographical index or directory representing the spatial area.
spatial.bbox A geographic bounding box delimiting the spatial area.
spatial.geometry A set of coordinates denoting the vertices of the spatial area.
spatial.centroid The coordinates of the centre of the spatial area.
{
  "spatial": {
    "gazetteer": {
      "id": "232422",
      "scheme": "GEONAMES",
      "description": "Kampala",
      "uri": "https://sws.geonames.org/232422"
    },
    "bbox": [
      -10.0,
      -10.0,
      10.0,
      10.0
    ],
    "geometry": {
      "type": "Polygon",
      "coordinates": [
        [
          [
            -10.0,
            -10.0
          ],
          [
            10.0,
            -10.0
          ],
          [
            10.0,
            10.0
          ],
          [
            -10.0,
            -10.0
          ]
        ]
      ]
    },
    "centroid": [
      0,
      0
    ]
  }
}
matamadio commented 1 year ago

The flexible approach using name/geonames would be the best on the user side probably. Adding the spatial attributes could resolve any possible misinterpretation. Do we need to validate this input?

ISO codes for subnational level: that should be a standard, but is it really? I tend to always add those fields, but often end up using datasets that apply different codelists, or older iso codes that changed meanwhile. Also, these are limited to ADM lev1, while data might go down to ADM2, 3 or 4.

stufraser1 commented 1 year ago

Agree, validation at this level is not usually needed, the biggest requirement is being descriptive about what areas are covered so flexible approach is preferred. I think the gazetteer entry under DCAT suffices, I don't think we want to complete users to include bbox, centroid for every ADM1 unit in the analysis - bbox and centroid for an event footprint or overall analysis domain is useful, but to describe every ADM unit in the analysis is asking too much and not useful.

matamadio commented 1 year ago

Agree, validation at this level is not usually needed, the biggest requirement is being descriptive about what areas are covered so flexible approach is preferred. I think the gazetteer entry under DCAT suffices, I don't think we want to complete users to include bbox, centroid for every ADM1 unit in the analysis - bbox and centroid for an event footprint or overall analysis domain is useful, but to describe every ADM unit in the analysis is asking too much and not useful.

Agree, too much detail to add if this applies to each individual location/resources; requires spatial tooling to get the bbox or centroid. I believe the BBox related to the main dataset can be automatically shown in consuming app based on country ISO-a2, as in the current DDH?

duncandewhurst commented 1 year ago

Yes, as I mentioned in https://github.com/GFDRR/rdl-standard/issues/52#issuecomment-1562089134, ISO 3166-2 is only for principle subdivisions (i.e. ADM-1) and different publishers might use different codelists anyway. As such, I agree that the flexible approach is preferable, even if it comes with some added complexity compared to mandating a particular scheme.

Regarding validation, given that the flexible approach is preferred, the ideal scenario would be to choose a recommended scheme, and implement validation of subnational_coverage.id and subnational_coverage.name when that scheme is declared in subnational_coverage.scheme. That type of validation is not supported by JSON Schema so we would need to implement an additional validation check (not a problem - we do similar things in different versions of CoVE for several other standards). However, depending on the chosen scheme it might be complex to get the codelist to validate against and to keep it up to date. For example, to get a machine-readable version of the ISO-3166-2 codelist, we would either need to buy the country codes collection from ISO ($330 USD per annum), scrape the data from the Online Browsing Platform (not sure of licensing implications) or use something like the iso-3166 node package (not sure how reliable that is).

So maybe we don't worry about validation and leave it up to consuming applications. We can reuse the location gazetteers codelist from OCDS as an open codelist for .scheme so that consuming applications can reliably identify the gazetteer from which the identifiers are drawn.

Regarding changes, if a scheme never reuses old codes for new geographies, then we can omit any version identifier from scheme. If a scheme reuses old codes for new geographies, then we'd need to include a version identifier.

I don't think we want to complete users to include bbox, centroid for every ADM1 unit in the analysis - bbox and centroid for an event footprint or overall analysis domain is useful, but to describe every ADM unit in the analysis is asking too much and not useful.

Agreed, that's in line with my proposal in https://github.com/GFDRR/rdl-standard/issues/52#issuecomment-1562089134, which has spatial as an object (not an array) at the dataset level so just one .gazetteer, .bbox, .geometry and .centroid per dataset. If, as in that proposal, we only had the spatial object then .gazetteer would need to describe the overall spatial coverage of the dataset, i.e. if a dataset covered both Kampala and Wakiso there would need to be a suitable gazetteer entry covering both regions. I don't think such entries exist, hence having the dataset-level subnational_coverage field, which is also a list of gazetteer entries, to provide for listing more than one subnational region.

On reflection, I think it might be cleaner to put everything under spatial and to make spatial.gazetteers an array. That will also make for easier reuse of spatial when we want to describe spatial coverage at other levels in the standard.

Revised proposal

Fields

Field Title Description Type Format
spatial Spatial coverage The geographical area covered by the dataset. object  
spatial.countries Countries The countries covered by the geographical area, from the open country codelist (ISO 3166-1). array (string)
spatial.gazetteerEntries Gazetteer entries Entries from geographical indices or directories describing the geographical area. This field should be used to describe subnational coverage. Use of is recommended. array (object)  
spatial.gazetteerEntires.id Identifier An identifier drawn from the gazetteer identified in .scheme. string  
spatial.gazetteerEntries.scheme Scheme The gazetteer from which the entry is drawn, from the open locationGazetteers codelist. string  
spatial.gazetteerEntries.description Description A description for the gazetteer entry. string iri
spatial.gazetteerEntries.uri Uniform resource locator A URI for the gazetteer entry. string  
spatial.bbox Bounding box A geographic bounding box delimiting the geographical area. array (number)  
spatial.geometry Geometry A set of coordinates denoting the vertices of the geographical area.    
spatial.geometry.type Type The GeoJSON geometry type that is described by .coordinates, from the closed geometryType codelist. string  
spatial.geometry.coordinates Coordinates One or more GeoJSON positions according to the GeoJSON geometry type defined in .type. array (number, array (number))  
spatial.centroid Centroid The coordinates of the centre of the geographical area. array (number)  

Codelists

locationGazetteers

Category Code Title Description Source URI Pattern
Subnational ISO 3166-2 ISO Country Subdivision Codes ISO codes for identifying the principal subdivisions (e.g. provinces or states) of all countries coded in ISO 3166-1. https://www.iso.org/standard/72483.html  
Subnational NUTS EU Nomenclature of Territorial Units for Statistics The Nomenclature of Territorial Units for Statistics (NUTS) was established by Eurostat in order to provide a single uniform breakdown of territorial units for the production of regional statistics for the European Union. https://ec.europa.eu/eurostat/web/nuts/linked-open-data http://data.europa.eu/nuts/code/
National ISO 3166-1 alpha-2 ISO Country Codes ISO 2-Digit Country Codes https://www.iso.org/iso-3166-country-codes.html  
Universal GEONAMES GeoNames GeoNames provides numerical identifiers for many points of interest around the world, including administrative divisions, populated centres and other locations, embedded within a structured tree of geographic relations. https://www.geonames.org/ https://www.geonames.org/
Universal OSMN OpenStreetMap Node OpenStreetMap Nodes consist of a single point in space defined by a latitude, longitude and node ID. Nodes might have tags to indicate the particular geographic feature they represent.   https://www.openstreetmap.org/node/
Universal OSMR OpenStreetMap Relation Relations are used to model logical (and usually local) or geographic relationships between objects. In practice, boundaries of geographic areas are available as Relations in OpenStreetMap. https://wiki.openstreetmap.org/wiki/Relation https://www.openstreetmap.org/relation/

geometryType

Code Title Description Source
Point Point The 'coordinates' member is a single position. https://tools.ietf.org/html/rfc7946#section-3.1
MultiPoint MultiPoint The 'coordinates' member is an array of positions. https://tools.ietf.org/html/rfc7946#section-3.1
LineString LineString The 'coordinates' member is an array of two or more positions. https://tools.ietf.org/html/rfc7946#section-3.1
MultiLineString MultiLineString The 'coordinates' member is an array of LineString coordinate arrays. https://tools.ietf.org/html/rfc7946#section-3.1
Polygon Polygon The 'coordinates' member must be an array of linear ring coordinate arrays. https://tools.ietf.org/html/rfc7946#section-3.1
MultiPolygon MultiPolygon The 'coordinates' member is an array of Polygon coordinate arrays. https://tools.ietf.org/html/rfc7946#section-3.1

Examples

Full

Somewhat pathological because it mixes gazetteer schemes. I don't think we'd expect to see that in practice.

{
  "spatial": {
    "countries": [
      "UG"
    ],
    "gazetteerEntries": [
      {
        "id": "UG-102",
        "scheme": "ISO-3166-2",
        "description": "Kampala"
      },
      {
        "id": "448224",
        "scheme": "GEONAMES",
        "description": "Wakiso",
        "uri": "https://sws.geonames.org/4482242"
      }
    ],
    "bbox": [
      -10.0,
      -10.0,
      10.0,
      10.0
    ],
    "geometry": {
      "type": "Polygon",
      "coordinates": [
        [
          [
            -10.0,
            -10.0
          ],
          [
            10.0,
            -10.0
          ],
          [
            10.0,
            10.0
          ],
          [
            -10.0,
            -10.0
          ]
        ]
      ]
    },
    "centroid": [
      0,
      0
    ]
  }
}

Simple

{
  "spatial": {
    "countries": [
      "UG"
    ],
    "gazetteerEntries": [
      {
        "description": "Kampala"      
      },
      {
        "description": "Wakiso",
      }
    ]
}

Outstanding questions

If that all sounds good, the remaining question is which subnational gazetteer to recommend. Given the context in @matamadio's last update that catalog tooling already supports ISO 3166-2, I think that should be the recommendation. On the understanding that other gazetteers can be used if a publisher wants to express more granular administrative levels, or if they just want to use a different codelist for some reason. We can perhaps include something in the documentation to that effect.

matamadio commented 1 year ago

Thanks, agree on the plan and on ISO 3166-2.

odscjen commented 1 year ago

closed by #105