Swirrl / table2qb

A generic pipeline for converting tabular data into rdf data cubes
Eclipse Public License 1.0
13 stars 4 forks source link

Wiring-up table2qb, csv2rdf.clj and grafter #20

Open Robsteranium opened 6 years ago

Robsteranium commented 6 years ago

The following provides a specification of what we'll need from this integration.

Retreive the configuration in the database

We're currently loading the configuration from a csv file when the project is initialised. Instead we ought to be able to load them from SPARQL when the transformation pipeline is called (#21).

This would allow users to modify how the transformation pipeline was configured. You would first upload your new column configurations (with a suitable pipeline), then upload a spreadsheet of observations using the columns you'd just configured. This will be critical for handling e.g. new dimensions given that, unlike past pmd transformation pipelines, we're no longer creating vocabularies at the same time as observations.

Specifically this requires:

Similarly we will ultimately want to lookup codelists in order to a) lookup URIs for codes from labels, and (potentially) b) validate the inputs to fail early (before generating rdf).

Orchestrate the csv2rdf calls

table2qb will output multiple csv and json files. The integration will need to orchestrate this as per these calls - i.e. having 3 grafter pipelines:

  1. components pipeline
  2. codelists pipeline
  3. cube pipeline (note that this consistents of 6 json files and only 2 csv)

Necessarily we would need to pass the uploaded input csv in each case to table2qb, then take the resulting json and csv and pass those to csv2rdf.

Optionally we could persist the intermediate outputs potentially hosting them for remote retreival. Indeed it seems that the csv2rdf standard requires that the json metadata refer to the csv file with a url property and that implementations are expected to retreive their inputs from this - further this leads to outputs like e.g. <row> csvw:url <file://input.csv#row=1> in standard mode (i.e. the row URIs extend these urls). See #11 for more discussion.

The practical consequence of hosting the intermediate files is that we could use them for a) debugging (such that the validation report could link to the cell/ column/ row that violated a given rule) and b) as a pre-made tabular serialisation (although this might only be a subset of the observation in a cube).

Note that we might want to ignore this optional persistence until we've implemented csv2rdf standard mode (since tracking cell inputs isn't part of the minimal mode we're targeting at this point).

At the moment, table2qb groups together multiple files by writing/ reading from the same directory. If we're dealing with pipelines that use file-inputs we could a) submit a tar archive (which has the benefit of being a single request) or b) maintain state of partial requests (since the "job" would consist of several requests).

We'd need to be able to run this as a standalone grafter pipeline and within the context of grafter-server.

Validation

It makes sense to distinguish two opportunities for validation:

We would want to run tabular validations during the table2qb phase. We could use csvw's tabular metadata specification for describing the validation. Since the definition of "valid" will depend upon the column configuration, we would need to generate such schema based upon what's in the database. Other validations (e.g. all cells are populated) would be context-free. Some examples of criteria:

We would want to run graph validations after the csv2rdf phase. We could use grafter-extra pmd/ qb validators (SPARQL ask), or something like the shapes report we've built for nhs. In any case, the validator should be able to query the live contents of the database.

Notice that there is some overlap - specifically in looking-up codes and ensuring partial-uploads have a common DSD (and thus observation-uri structure). It might make more sense to do this in the context of the graph (where you can point at the codelists/ DSD that are already loaded) but failing-fast (i.e at the table stage) may be preferable (and we've been able to return understandable errors at this stage in the past).

Robsteranium commented 6 years ago

I've added tasks to Jira:

RickMoynihan commented 6 years ago

runtime parametisation of e.g. name->component (via dependency injection, atom/redef, or configuration monad)

We've been using integrant/duct on zib, drafter etc. It works quite well, and I'd ultimately like to move to using it inside grafter/grafter-server too. If you need this, I'd consider looking into it.

Integrant is the main piece, but duct has some extras; though you're probably best starting small with integrant, and adding aero if need a few more bells and whistles, we currently do this on the stasher branch of drafter. Aero's good to add some extra goodies to integrant for joining strings, pulling in env vars in the config etc prior to initialising with integrant