GFDRR / rdl-standard

The Risk Data Library Standard (RDLS) is an open data standard to make it easier to work with disaster and climate risk data. It provides a common description of the data used and produced in risk assessments, including hazard, exposure, vulnerability, and modelled loss, or impact, data.
https://docs.riskdatalibrary.org/
Creative Commons Attribution Share Alike 4.0 International
13 stars 1 forks source link

[Schema] Spatial coordinates format #197

Closed matamadio closed 4 months ago

matamadio commented 1 year ago

E.g. for bbox, geometry: couples of W N numbers are expected, such as:

Global: -180 90; 180 90; 180 -90;-180 -90; -180 90

The description is

Enter multiple values as a semicolon-separated list, e.g. a;b;c

Is space the right character to use between W and N?

Also, the values must be entered as ="-180 90;180 90;180 -90;-180 -90;-180 90" to avoid excel's "there's a problem with this formula".

Thus the bbox coordinates in the THA examples are wrong, will fix those.

odscjen commented 1 year ago

For bbox you should only be using 4 values, see https://wiki.openstreetmap.org/wiki/Bounding_Box for an explanation of this data type.

If your dataset is Global I'd recommend not including either bbox or geometry as they're fairly meaningless in that context.

But looking at it I think we've made a mistake in Geometry as we have .coordinates as an array of numbers

"coordinates": {
  "title": "Coordinates",
  "type": "array",
  "description": "One or more GeoJSON positions according to the GeoJSON geometry type defined in `.type`.",
    "items": {
      "type": [
        "number",
        "array"
      ],
      "minItems": 1
    },
    "minItems": 1
 }

but I think it should actually be an array of arrays of numbers to allow for the pairs, e.g.

"coordinates": {
  "title": "Coordinates",
  "description": "The relevant array of points, e.g. [longitude, latitude], or a nested array of points, for the GeoJSON geometry being described. The longitude and latitude must be expressed in decimal degrees in the WGS84 (EPSG:4326) projection",
    "type": [
       "array",
       "null"
    ],
    "items": {
      "type": [
         "number",
         "array"
      ],
    "minItems": 1
  },
  "minItems": 1
}

@duncandewhurst does this seem right? @matamadio we'll be able to advise on the correct spreadsheet entry here once we've ensured the schema is right :)

duncandewhurst commented 1 year ago

Any of the GeoJSON geometry types (Point, MultiPoint, LineString, MultiLineString, Polygon, MultiPolygon) are valid against the current schema so there's nothing wrong with the schema in terms of correct data validating.

However, because coordinates is an array whose items can be either numbers or arrays with items of any type, the validation is very permissive so it is possible for incorrect data to validate against a schema, e.g. "coordinates": [["a"]] would pass validation, whilst clearly not being a valid set of coordinates for any geometry type.

The challenge is that, in order to express different geometry types, the schema needs to allow for nested coordinate pairs with differing constraints on the number of items depending on the geometry type, and nesting can make for very complex schema and error messages.

We could tighten up validation by constraining the type of the nested arrays without the error messages getting too complex:

Schema

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "properties": {
    "coordinates": {
      "title": "Coordinates",
      "type": "array",
      "description": "One or more GeoJSON positions according to the GeoJSON geometry type defined in `.type`.",
      "items": {
        "type": [
          "number",
          "array"
        ],
        "items": {
          "type": [
            "number",
            "array"
          ],
          "items": {
            "type": [
              "number",
              "array"
            ],
            "items": {
              "type": "number"
            },
            "minItems": 1
          },
          "minItems": 1
        },
        "minItems": 1
      },
      "minItems": 1
    }
  }
}

Invalid data

{
  "type": "MultiLineString",
  "coordinates": [
    [
      [
        0,
        0
      ],
      [
        1,
        1
      ]
    ],
    [
      [
        1,
        1
      ],
      [
        "a",
        2
      ]
    ]
  ]
}

Error message

image

However, the following types of incorrect data will still validate against the schema:

It should be possible to implement conditional validation of types and the number of items using JSON Schema's if-then-else keywords, but that would make the schema and error messages very complicated and would likely require significant extra work to CoVE to make the error messages meaningful to users.

There are some rules in GeoJSON that it is not possible to encode in JSON Schema, notably that linear rings (in Polygons and MultiPolygons) are closed or that they follow the right-hand rule. To validate that data conforms to those rules it would be necessary to implement an additional check in CoVE.

So in the current block of work, from a schema perspective, I think we have the following options:

  1. Leave it as it is
  2. Add the type constraints described above

Conditional validation and additional checks on linear rings would need to wait for a future block of work covering both the schema and CoVE.

I'll open a separate issue on input formats in the spreadsheet template.

odscjen commented 1 year ago

I suspect that as polygons will be the most commonly used geometry type here we should go for option 2 and add the type constraints.

matamadio commented 1 year ago

The constraint-validated approach would be the most complete indeed, but I see the time requirement and complications it could bring. Let's do a step-back thinking and see if this is worth the effort.

My suggestion is to limit the spatial field to bbox only, expecting 4-numbers standard input (WGS84).

odscjen commented 1 year ago

My suggestion is to limit the spatial field to bbox only, expecting 4-numbers standard input (WGS84).

I agree with this. As Mat says types like line or point are not relevant to describe the extent of a dataset and it's not really important for someone to include the full outline of a potentially complex polygon in what is essentially catalogue level metadata.

duncandewhurst commented 1 year ago

Moving back to under discussion because the Location definition is used in three places:

Is it fine to remove centroid and geometry in all of those cases?


Generally, I would lean towards keeping at least geometry for the following reasons:

That said, happy to take your steer on this. If you think there is no chance of geometry actually being used in practice, we should remove it.

stufraser1 commented 1 year ago

Good catch - polygon may be realistically used in those cases of vulnerablity and event_set. I see specific guidance in vulnerability/spatial that .scale is required. This type of guidance could be replicated for spatial (dataset level) to direct users to use a bounding box and other fields being optional.

matamadio commented 1 year ago

I don't see a strong advantage for a polygon-shaped area defition instead of a box, but agree it could be useful to support it. Keep it optional though. If we allow polygons, there should also be guidelines on the max number of vertexes (coordinates) to include, e.g. max 50 points, to avoid issues. In case of implementation issues, I would consider it a lesser priority (postponable to next update).

duncandewhurst commented 1 year ago

Okay sounds good so I think we have agreement that the only change required is to update the description of the dataset-level spatial field to direct users to use a bounding box and other fields being optional.

matamadio commented 4 months ago

We are trying to enter bbox for some data.

spatial/bbox
Enter multiple values as a semicolon-separated list, e.g. a;b;c. Values must not contain semicolons or commas.
-180 -90 180 90

Example

Stu and I tried in many formats but the convertor didn't like any :(

odscjen commented 4 months ago

The values in the xlsx weren't semi-colon separated in either of the sheets bbox appears in. I've updated the sheet and run it again and it now passes, at least for the bbox test :) https://metadata.riskdatalibrary.org/data/b7762418-48f9-4a88-8257-029669887115

There must be a semi-colon between the each of the coordinates, the second part of the instructions is saying that for each of the coordinate numbers must not contain a semi-colon or space