Swirrl / table2qb

A generic pipeline for converting tabular data into rdf data cubes
Eclipse Public License 1.0
13 stars 4 forks source link

Review with respect to W3C Tabular Data Model #1

Closed Robsteranium closed 6 years ago

Robsteranium commented 7 years ago

https://www.w3.org/TR/tabular-data-model/

Of particular note is the overall json schema and URI templating approach. There may be re-usable components and there's probably a benefit to standardising where relevant.

Robsteranium commented 7 years ago

I've added an example of how the metadata json would look in the csvw metadata format.

The use of aboutURL, propertyURL and valueURL for s-p-o templates is quite elegant (can be set per cell or inherited from the column or table) and would work for more than just rdf-cubes. The use of virtual columns is also quite neat - allowing you to describe intermediate transformation steps or to add arbitrary statements (here I've used it to relate observations to the cube).

DSD and Component Specs

We would require some additional information however. Most notably, we need to know the type of component so we can attach it to the DSD. We could look the propertyURLs up in the database, or, if necessary, add this attribute to the column description object. This really depends on how we arrange the overall architecture.

The csvw rdf cube example suggests includes a json-ld definition of the DSD, similar to that used in the original strict json:

  "qb:structure": {
      "qb:component": [
          { "qb:dimension": "sdmx-dimension:refArea" },
          { "qb:dimension": "sdmx-dimension:refPeriod" },
          { "qb:dimension": "qb:measureType" },
          { "qb:measure": "???" },
          { "qb:attribute": "sdmx-attribute:unitMeasure" },
          { "qb:dimension": "sdmx-dimension:age" },
          { "qb:dimension": "dimension:gender" }
      ]
  }

The measures component specs are thus a little trickier. We could either a) define all of these in the json (as a result of a first pass in relaxed mode) or b) define these with a virtual column in the table-schema (we'd need to give the DSD a URI instead of a blank node as above, and also ensure that the component-spec URI didn't clash).

Lookup Tables

It looks like we can also describe foreign key relations under a Schema. This might be useful for specifying lookups (e.g. from ids or codes to labels). This might make sense when combining csv documents but perhaps not when we have database queries etc.

Cell Transformations

It doesn't appear to be possible to describe cell-transformations - most notably slugize - so it looks like we're always going to have to pre-process the table to do this. There might be something in URI templates though (#3). This needs a bit more investigation.

BillSwirrl commented 7 years ago

thanks for this Robin - v interesting analysis and experiment.

I suppose the challenge here is how we know which columns in the input are meant to correspond to dimensions, measures or attributes. In our current sns2rdf we do that by having specific names 'Measurement' and 'Units' for the measure column and unit/attribute column, then everything else is assumed to be a dimension.

A similar approach would be possible for a CSVW approach - i.e. a naming convention for columns.

A more general approach - particularly if we are going to assume vocabs are managed outside of the table2qb pipeline - would be to look up an existing ontology to find out whether a particular property was already defined as a qb:DimensionProperty, qb:MeasureProperty or qb:AttributeProperty.

Given that our DSDs have codelists in them listing all the distinct values in the dataset, we can't rely on the input metadata having a complete set of DSD triples in it. In any case, that would make the metadata file very big and would need automated support for generating it. So I think a better approach would be for our pipeline to generate that and not try to put a complete or partial DSD into the metadata. It would mean that our pipeline generates more/different RDF to what is defined in https://www.w3.org/TR/csv2rdf/ That's probably ok - I think there is still value in following the standard in terms of the metadata spec, even if our processor for the tabular data is not standards compliant.

That might be something we could discuss with Jeremy Tandy or Jeni Tennison (both involved in creating these specs).

Re cell transformations and 'slugize', perhaps we require that the contents of the input are already slugs (rather than labels) - and the labels are defined in the already-existing vocabs that we are depending on.

(Btw, I appreciate this makes the pipeline more complicated as it is depending on external data - we'll need to make sure we don't mess up the performance of it)

In general, preparation of the metadata is going to be complicated. It encompasses various data modelling decisions that need some expert input. So we need to decide how we do that. Do we (Swirrl) create these on behalf of PMD users? Possible but would strongly prefer that PMD users can create new vocabs and datasets without needing Swirrl input.

Robin's idea of a 'relaxed mode' pre-processing pipeline that creates the 'strict mode' inputs is definitely an option. Though if we want to constrain property and dimension value choices to 'allowed'/'managed' values, then that would still need more constraints than our current approach in sns2rdf

Maybe we need some additional tooling to help users decide what vocab they want and then use that to generate metadata and validate input? Some of that could be built into an interactive tool of some sort, but would that get in the way of automation?

What would a user do in detail (eg Liam, Gregor et al who are currently generating input data for SG pipelines) to generate input?

I haven't thought this through properly...but maybe we could create an interactive tool, which let's a user create their own strict-mode pipeline config. It could take them through some kind of wizard-like approach to select from a curated list of ontologies and concept schemes which properties to associate with columns and which values are allowed to appear in the cells. It would save those choices in the CSVW metadata file. Future updates to that dataset could use the same metadata file (which could potentially be held in PMD, associated with the datasets.

Robsteranium commented 7 years ago

I think, broadly speaking, we should adopt the csvw model for loading tables of observations.

If we build the DSD (and comp specs etc) separtately then we can resolve the qb:ComponentProperty sub-property separately (i.e. whether via convention, query or explicit configuraiton).

New observations might extend the codelists, but they cannot change the DSD itself. It's not clear to me whether we would want this extension to happen as part of the loading process (with retrospective warning/ notification - "did you mean to add 'males'?") or before hand (so that the observation pipeline would validate against an existing spec).

I agree we should make the csvw pipeline strict w.r.t readily-uriable inputs (i.e. pre-slugged or prefixed).

I like the suggestion in that last paragraph. We'd need to add tools for building/ maintaining ComponentProperties and CodeLists.

Robsteranium commented 6 years ago

I think these concerns have now been picked up in other threads.