frictionlessdata / datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.
https://datapackage.org
The Unlicense
481 stars 109 forks source link

Comparison with Discover and Data Cube #870

Closed pukkamustard closed 2 months ago

pukkamustard commented 5 months ago

Just curious about how Frictionless Data compares to existing standards such as Discovery for describing data sets and Data Cube for tabular data. I couldn't find any references in the documentation. Maybe a small note that compares Frictionless Data to existing standards would be helpful?

rjgladish commented 5 months ago

Frictionless Tabular Schema typically describes the physical structure of ONE table [descriptor] within a dataset as one or more field descriptors. The optional field descriptor property rdfType can be added to establish a relationship between the field value datatype and the interpretation of the value, essentially the values represent a measured observation. The documentation for tabular and field descriptors are clear and concise.

e.g. depth (numeric)

no mention of units of measure, or what that depth represents in the physical world

In contrast, Discovery focuses on the vocabulary characterizing the collection and utilization of dataset, allowing others to discover datasets by searching for specific questions, topics, and geographical coverage (see https://rdf-vocabulary.ddialliance.org/discovery.html#scope-and-purpose).

e.g. Riverine Data Samples

Data Cube references datasets abstractly (see https://www.w3.org/TR/vocab-data-cube/#cubes-model-datasets), and focuses on the quantity-kind, units, and attributes of measured (or calculated) observation data.

A dataset comprises a collection of observations as a set of dimensions, attributes and measures as RDF properties. The collection can be characterized by a set of dimensions that define what the observation applies to (e.g. time, area, gender) along with metadata describing what has been measured (e.g. economic activity, population), how it was measured and how the observations are expressed (e.g. units, multipliers, status). See Data Cube model https://www.w3.org/TR/vocab-data-cube/#cubes-model

e.g. http://www.ontology-of-units-of-measure.org/resource/om-2/metre

The closest comparison to Frictionless Tabular Schema is roughly Metadata Vocabulary for Tabular Data (https://www.w3.org/TR/tabular-metadata/). Combined with CSVW (https://www.w3.org/ns/csvw), they establish a vocabulary for table and column descriptors of CSV. CSVW describes dataset tables in terms of table descriptors and column descriptors using a RDF vocabulary. CSVW can define a relationship between the observation value and the underlying concept.

Say a field in Frictionless name="Depth", datatype numeric, and rdfType=..../riverDepth Per the standard, rdfType MUST be the URI of a RDF Class, that is an instance or subclass of RDF Schema Class object

The same column in CSVW is http://www.w3.org/ns/csvw#name="Depth", with a numeric datatype URI, and http://www.w3.org/ns/csvw#propertyUri= .../riverDepth

From there, .... /riverDepth mighty contain data cube properties to specify quantityKind and units of measure, along with other models relationships implied, but explicitly described in the physical schema. The Discovery dataset may reference an ontology that describes all the characteristics of Riverine observations, include depth, temperature, and turbidity, along with information about which river the observations were taken, when and why the data was collected, and so on.

There's a ROUGH equivalence between most of CSVW and Frictionless Tabular Schema, but I personally find Frictionless Tabular Schema much more straightforward and easier to implement than CSVW.

I've seen encouraging discussion of late on clarifying usage JSON-LD context to bridge rdfType and rdf Class.