odscrachel commented 1 year ago

What is the context or reason for the change?

There is a need to determine the location covered by the data e.g. "Afghanistan", "Kampala", "South-East Asia". This can include multi-country, regions or sub-country regions.

What is your proposed change?

Proposal is to create a countries field (ISO 3166-2) to list all the countries covered by the dataset and to rename geo_coverage to subnational_coverage with the description ‘Locations covered by the data at a sub-national level, e.g. specific cities or states.’ subnational_coverage would be an optional field.

Why is this not covered by the existing model?

There is currently a geo_coverage field to give the ‘ISO codes of countries covered by the dataset.’

Can you provide an example?


"countries": [
  "UG"
],
"subnational_coverage": [
  "Kampala"
],

duncandewhurst commented 1 year ago

DCAT has a spatial/geographic coverage property, which we should check our modelling against.

matamadio commented 1 year ago

Originally we considered DCAT standard for framing the fields; the idea of the location attribute was precisely this:

immagine However this should be secondary to the countries specification.

They suggest to use Geoname url instead of simple location name. It might be a good idea; alternative would be to use subnational unit codes, but they thend to change over time.

They use a specific class for location attributes:

immagine

Should we do the same? Or just a centroid point coordinate field?

stufraser1 commented 1 year ago

The structure proposed by @odscrachel makes sense. I think users are better served with ISO code for countries as suggested and name of the region or location for subnational level, rather than a code or url.

Seems sensible to recommend using names as provided in geonames or similar as a reference for region naming, but I think OK to recommend rather than prescribe this.

Question: can we include a string of multiple named regions at subnational level?

duncandewhurst commented 1 year ago

Countries

countries as an array of ISO-3166-1 alpha-2 codes looks good.

Subnational coverage

The problems with modelling subnational_coverage as a list of names without enforcing a particular codelist are that the names can't be validated and consuming applications and users can't reliably use names to look up translations and other information such as the population counts etc.

As for countries, ideally, we would choose an authoritative codelist for subnational_coverage and leave looking up code labels to consuming applications. For example, if we used ISO 3166-2, a dataset covering Kampala and Wakiso would look like this:

{
  "id": "1",
  "countries": [
    "UG"
  ],
  "subnational_coverage": [
    "UG-102",
    "UG-113"
  ]
}

This is a good approach because:

it provides for easy and comprehensive validation
there's no opportunity for inconsistencies between codes and their labels
consuming and applications and users can use the codes to look up translations of the labels and other information, like the subdivision's category, parent subdivision, population etc.
consuming applications and users need only deal with one type of code.
it flattens to a single table for spreadsheet users

id countries subnational_coverage

1 UG UG-102,UG-103

id	countries	subnational_coverage
1	UG	UG-102,UG-103

However, the challenges to this approach are:

it's extra work for users and consuming applications to look-up code labels
the choice of codelist might restrict publishers to a particular level of subnational administrative division, for example, ISO 3166-2 only describes principle subdivisions (typically regions), not localities or other levels.
the choice of codelist might differ from the list of subnational administrative divisions used by publishers, for example, EU publishers might use NUTS

A more flexible approach would be to allow publishers to use their own classifications, to disclose the gazetteer from which the names are drawn, and to include the code's labels, e.g. a publisher using ISO 3166-2 would look like this:

{
  "id": "1",
  "countries": [
    "UG"
  ],
  "subnational_coverage": [
    {
      "id": "UG-102",
      "scheme": "ISO-3166-2",
      "description": "Kampala"      
    },
    {
      "id": "UG-113",
      "scheme": "ISO-3166-2",
      "description": "Wakiso"      
    }
  ]
}

Another publisher could use Geonames:

{
  "id": "1",
  "countries": [
    "UG"
  ],
  "subnational_coverage": [
    {
      "description": "Kampala",
      "id": "232422",
      "scheme": "GEONAMES",
      "uri": "https://sws.geonames.org/232422"
    },
    {
      "description": "Wakiso",
      "id": "448224",
      "scheme": "GEONAMES",
      "uri": "https://sws.geonames.org/4482242"
    }
  ]
}

We could allow publishers to 'fall back' to only providing names:

{
  "id": "1",
  "countries": [
    "UG"
  ],
  "subnational_coverage": [
    {
      "description": "Kampala"
    },
    {
      "description": "Wakiso"
    }
  ]
}

However, there are some challenges with this approach, too:

if we allow any scheme, the data can't be validated
if we allow a particular set of schemes, validation gets complicated
the fall-back option has the same challenges as modelling as a list of names
the flattened data requires multiple tables:

id countries

1 UG

id subnational_coverage/0/description

1 Kampala

1 Wakiso

id	countries
1	UG

id	subnational_coverage/0/description
1	Kampala
1	Wakiso

DCAT conformance

If we choose to target DCAT conformance, then, in addition to countries and subnational_coverage, I think we would want to introduce a spatial object:

Field	Description
spatial	The geographical area covered by the dataset.
spatial.gazetteer	An entry from a geographical index or directory representing the spatial area.
spatial.bbox	A geographic bounding box delimiting the spatial area.
spatial.geometry	A set of coordinates denoting the vertices of the spatial area.
spatial.centroid	The coordinates of the centre of the spatial area.

{
  "spatial": {
    "gazetteer": {
      "id": "232422",
      "scheme": "GEONAMES",
      "description": "Kampala",
      "uri": "https://sws.geonames.org/232422"
    },
    "bbox": [
      -10.0,
      -10.0,
      10.0,
      10.0
    ],
    "geometry": {
      "type": "Polygon",
      "coordinates": [
        [
          [
            -10.0,
            -10.0
          ],
          [
            10.0,
            -10.0
          ],
          [
            10.0,
            10.0
          ],
          [
            -10.0,
            -10.0
          ]
        ]
      ]
    },
    "centroid": [
      0,
      0
    ]
  }
}

matamadio commented 1 year ago

The flexible approach using name/geonames would be the best on the user side probably. Adding the spatial attributes could resolve any possible misinterpretation. Do we need to validate this input?

ISO codes for subnational level: that should be a standard, but is it really? I tend to always add those fields, but often end up using datasets that apply different codelists, or older iso codes that changed meanwhile. Also, these are limited to ADM lev1, while data might go down to ADM2, 3 or 4.

stufraser1 commented 1 year ago

Agree, validation at this level is not usually needed, the biggest requirement is being descriptive about what areas are covered so flexible approach is preferred. I think the gazetteer entry under DCAT suffices, I don't think we want to complete users to include bbox, centroid for every ADM1 unit in the analysis - bbox and centroid for an event footprint or overall analysis domain is useful, but to describe every ADM unit in the analysis is asking too much and not useful.

matamadio commented 1 year ago

Agree, validation at this level is not usually needed, the biggest requirement is being descriptive about what areas are covered so flexible approach is preferred. I think the gazetteer entry under DCAT suffices, I don't think we want to complete users to include bbox, centroid for every ADM1 unit in the analysis - bbox and centroid for an event footprint or overall analysis domain is useful, but to describe every ADM unit in the analysis is asking too much and not useful.

Agree, too much detail to add if this applies to each individual location/resources; requires spatial tooling to get the bbox or centroid. I believe the BBox related to the main dataset can be automatically shown in consuming app based on country ISO-a2, as in the current DDH?

duncandewhurst commented 1 year ago

Yes, as I mentioned in https://github.com/GFDRR/rdl-standard/issues/52#issuecomment-1562089134, ISO 3166-2 is only for principle subdivisions (i.e. ADM-1) and different publishers might use different codelists anyway. As such, I agree that the flexible approach is preferable, even if it comes with some added complexity compared to mandating a particular scheme.

Regarding validation, given that the flexible approach is preferred, the ideal scenario would be to choose a recommended scheme, and implement validation of subnational_coverage.id and subnational_coverage.name when that scheme is declared in subnational_coverage.scheme. That type of validation is not supported by JSON Schema so we would need to implement an additional validation check (not a problem - we do similar things in different versions of CoVE for several other standards). However, depending on the chosen scheme it might be complex to get the codelist to validate against and to keep it up to date. For example, to get a machine-readable version of the ISO-3166-2 codelist, we would either need to buy the country codes collection from ISO ($330 USD per annum), scrape the data from the Online Browsing Platform (not sure of licensing implications) or use something like the iso-3166 node package (not sure how reliable that is).

So maybe we don't worry about validation and leave it up to consuming applications. We can reuse the location gazetteers codelist from OCDS as an open codelist for .scheme so that consuming applications can reliably identify the gazetteer from which the identifiers are drawn.

Regarding changes, if a scheme never reuses old codes for new geographies, then we can omit any version identifier from scheme. If a scheme reuses old codes for new geographies, then we'd need to include a version identifier.

I don't think we want to complete users to include bbox, centroid for every ADM1 unit in the analysis - bbox and centroid for an event footprint or overall analysis domain is useful, but to describe every ADM unit in the analysis is asking too much and not useful.

Agreed, that's in line with my proposal in https://github.com/GFDRR/rdl-standard/issues/52#issuecomment-1562089134, which has spatial as an object (not an array) at the dataset level so just one .gazetteer, .bbox, .geometry and .centroid per dataset. If, as in that proposal, we only had the spatial object then .gazetteer would need to describe the overall spatial coverage of the dataset, i.e. if a dataset covered both Kampala and Wakiso there would need to be a suitable gazetteer entry covering both regions. I don't think such entries exist, hence having the dataset-level subnational_coverage field, which is also a list of gazetteer entries, to provide for listing more than one subnational region.

On reflection, I think it might be cleaner to put everything under spatial and to make spatial.gazetteers an array. That will also make for easier reuse of spatial when we want to describe spatial coverage at other levels in the standard.

Revised proposal

Fields

Field	Title	Description	Type	Format
`spatial`	Spatial coverage	The geographical area covered by the dataset.	object
`spatial.countries`	Countries	The countries covered by the geographical area, from the open country codelist (ISO 3166-1).	array (string)
`spatial.gazetteerEntries`	Gazetteer entries	Entries from geographical indices or directories describing the geographical area. This field should be used to describe subnational coverage. Use of is recommended.	array (object)
`spatial.gazetteerEntires.id`	Identifier	An identifier drawn from the gazetteer identified in `.scheme`.	string
`spatial.gazetteerEntries.scheme`	Scheme	The gazetteer from which the entry is drawn, from the open locationGazetteers codelist.	string
`spatial.gazetteerEntries.description`	Description	A description for the gazetteer entry.	string	iri
`spatial.gazetteerEntries.uri`	Uniform resource locator	A URI for the gazetteer entry.	string
`spatial.bbox`	Bounding box	A geographic bounding box delimiting the geographical area.	array (number)
`spatial.geometry`	Geometry	A set of coordinates denoting the vertices of the geographical area.
`spatial.geometry.type`	Type	The GeoJSON geometry type that is described by `.coordinates`, from the closed geometryType codelist.	string
`spatial.geometry.coordinates`	Coordinates	One or more GeoJSON positions according to the GeoJSON geometry type defined in `.type`.	array (number, array (number))
`spatial.centroid`	Centroid	The coordinates of the centre of the geographical area.	array (number)

Codelists

locationGazetteers

Category	Code	Title	Description	Source	URI Pattern
Subnational	ISO 3166-2	ISO Country Subdivision Codes	ISO codes for identifying the principal subdivisions (e.g. provinces or states) of all countries coded in ISO 3166-1.	https://www.iso.org/standard/72483.html
Subnational	NUTS	EU Nomenclature of Territorial Units for Statistics	The Nomenclature of Territorial Units for Statistics (NUTS) was established by Eurostat in order to provide a single uniform breakdown of territorial units for the production of regional statistics for the European Union.	https://ec.europa.eu/eurostat/web/nuts/linked-open-data	http://data.europa.eu/nuts/code/
National	ISO 3166-1 alpha-2	ISO Country Codes	ISO 2-Digit Country Codes	https://www.iso.org/iso-3166-country-codes.html
Universal	GEONAMES	GeoNames	GeoNames provides numerical identifiers for many points of interest around the world, including administrative divisions, populated centres and other locations, embedded within a structured tree of geographic relations.	https://www.geonames.org/	https://www.geonames.org/
Universal	OSMN	OpenStreetMap Node	OpenStreetMap Nodes consist of a single point in space defined by a latitude, longitude and node ID. Nodes might have tags to indicate the particular geographic feature they represent.		https://www.openstreetmap.org/node/
Universal	OSMR	OpenStreetMap Relation	Relations are used to model logical (and usually local) or geographic relationships between objects. In practice, boundaries of geographic areas are available as Relations in OpenStreetMap.	https://wiki.openstreetmap.org/wiki/Relation	https://www.openstreetmap.org/relation/

geometryType

Code	Title	Description	Source
Point	Point	The 'coordinates' member is a single position.	https://tools.ietf.org/html/rfc7946#section-3.1
MultiPoint	MultiPoint	The 'coordinates' member is an array of positions.	https://tools.ietf.org/html/rfc7946#section-3.1
LineString	LineString	The 'coordinates' member is an array of two or more positions.	https://tools.ietf.org/html/rfc7946#section-3.1
MultiLineString	MultiLineString	The 'coordinates' member is an array of LineString coordinate arrays.	https://tools.ietf.org/html/rfc7946#section-3.1
Polygon	Polygon	The 'coordinates' member must be an array of linear ring coordinate arrays.	https://tools.ietf.org/html/rfc7946#section-3.1
MultiPolygon	MultiPolygon	The 'coordinates' member is an array of Polygon coordinate arrays.	https://tools.ietf.org/html/rfc7946#section-3.1

Examples

Full

Somewhat pathological because it mixes gazetteer schemes. I don't think we'd expect to see that in practice.

{
  "spatial": {
    "countries": [
      "UG"
    ],
    "gazetteerEntries": [
      {
        "id": "UG-102",
        "scheme": "ISO-3166-2",
        "description": "Kampala"
      },
      {
        "id": "448224",
        "scheme": "GEONAMES",
        "description": "Wakiso",
        "uri": "https://sws.geonames.org/4482242"
      }
    ],
    "bbox": [
      -10.0,
      -10.0,
      10.0,
      10.0
    ],
    "geometry": {
      "type": "Polygon",
      "coordinates": [
        [
          [
            -10.0,
            -10.0
          ],
          [
            10.0,
            -10.0
          ],
          [
            10.0,
            10.0
          ],
          [
            -10.0,
            -10.0
          ]
        ]
      ]
    },
    "centroid": [
      0,
      0
    ]
  }
}

Simple

{
  "spatial": {
    "countries": [
      "UG"
    ],
    "gazetteerEntries": [
      {
        "description": "Kampala"      
      },
      {
        "description": "Wakiso",
      }
    ]
}

Outstanding questions

If that all sounds good, the remaining question is which subnational gazetteer to recommend. Given the context in @matamadio's last update that catalog tooling already supports ISO 3166-2, I think that should be the recommendation. On the understanding that other gazetteers can be used if a publisher wants to express more granular administrative levels, or if they just want to use a different codelist for some reason. We can perhaps include something in the documentation to that effect.

matamadio commented 1 year ago

Thanks, agree on the plan and on ISO 3166-2.

odscjen commented 1 year ago

closed by #105

GFDRR / rdl-standard

[Proposal] Location #52

What is the context or reason for the change?

What is your proposed change?

Why is this not covered by the existing model?

Can you provide an example?

Countries

Subnational coverage

DCAT conformance

Revised proposal

Fields

Codelists

locationGazetteers

geometryType

Examples

Full

Simple

Outstanding questions