frictionlessdata / datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.
https://datapackage.org
The Unlicense
481 stars 109 forks source link

Support for tagged union field types #882

Open khusmann opened 4 months ago

khusmann commented 4 months ago

(Adapted from my comment here at @peterdesmet's request!)

(Note that I would consider this proposal to be "low-priority" at the present moment, because it depends on the acceptance of the categorical field type in #875 )

Sometimes tabular data are produced in a "long" format that combines data of multiple different types into a single field / column. I see this form of data a lot in event-driven sensor data. For example:

measurementType measurementValue
cloudiness partly cloudy
cloudiness cloudy
temperature 1
wind force 5
temperature 10

Where:

(Example adapted from @peterdesmet's work here)

Here, measurementValue is not a single type, but actually a union of three types: either a cloudiness measurement, a temperature measurement, or a wind force measurement, each with their own type definitions and constraints. More specifically, this is a tagged union aka discriminated union compound type, where type of measurementValue depends on the "tag" or "discriminator" found in measurmentType.

Tagged union types are a well-established, well-understood abstraction already implemented in many programming languages (e.g. python, rust, etc.) and semantic data parsing / validation libraries (e.g. python's pydantic; and typescript's zod).

Implementing this behavior as a tagged union field type would allow implementations to validate this type of field by parsing its underlying types. It could also perform exhaustiveness checks on the definition (ensure that all levels in the categorical measurementType had corresponding type definitions). It would also facilitate implementations pivoting into wider table formats, because the dependent type definitions would translate into the column types of the resulting wide columns.

Here's an example of how a tagged union field type might look like in frictionless (using the proposed categorical syntax in #875:

{
  "fields": [
    {
      "name": "measurementType",
      "type": "categorical",
      "categories": ["cloudiness", "temperature", "wind force"]
    },
    {
      "name": "measurementValue",
      "type": "union",
      "tag": "measurementType",
      "match": {
        "cloudiness": {
          "type": "categorical",
          "categories": ["clear", "mostly clear", "partly cloudy", "mostly cloudy", "cloudy", "unknown"]
        },
        "temperature": {
          "type": "number",
          "constraints": {
            "min": 0,
            "max": 20
          }
        },
        "wind force": {
          "type": "categorical",
          "categories": [0, 1, 2, 3, 4, 5]
        }
      }
    }
  ]
}

Note that the field-level validation on this type would ensure that all the levels of the measurementType categorical field were represented as keys of the match property in the measurementValue field. For example, if temperature wasn't defined as a key in the match property, this would trigger a validation error because temperature is one of the levels of the measurementType field, As mentioned earlier, this is a common feature of tagged union types.

If there is interest in this type, I can put together a more formal definition of the proposed union field's type signature (and RFC language).