Wiring-up table2qb, csv2rdf.clj and grafter

Robsteranium commented 6 years ago

The following provides a specification of what we'll need from this integration.

Retreive the configuration in the database

We're currently loading the configuration from a csv file when the project is initialised. Instead we ought to be able to load them from SPARQL when the transformation pipeline is called (#21).

This would allow users to modify how the transformation pipeline was configured. You would first upload your new column configurations (with a suitable pipeline), then upload a spreadsheet of observations using the columns you'd just configured. This will be critical for handling e.g. new dimensions given that, unlike past pmd transformation pipelines, we're no longer creating vocabularies at the same time as observations.

Specifically this requires:

runtime parametisation of e.g. name->component (via dependency injection, atom/redef, or configuration monad)
a columns/ configuration model+pipeline (i.e. to upload columns.csv and create e.g. csvw:Column - see issue #21)
a SPARQL query to extract the above (and functions to transform this into the lookups etc table2qb expects)

Similarly we will ultimately want to lookup codelists in order to a) lookup URIs for codes from labels, and (potentially) b) validate the inputs to fail early (before generating rdf).

Orchestrate the csv2rdf calls

table2qb will output multiple csv and json files. The integration will need to orchestrate this as per these calls - i.e. having 3 grafter pipelines:

components pipeline
codelists pipeline
cube pipeline (note that this consistents of 6 json files and only 2 csv)

Necessarily we would need to pass the uploaded input csv in each case to table2qb, then take the resulting json and csv and pass those to csv2rdf.

Optionally we could persist the intermediate outputs potentially hosting them for remote retreival. Indeed it seems that the csv2rdf standard requires that the json metadata refer to the csv file with a url property and that implementations are expected to retreive their inputs from this - further this leads to outputs like e.g. <row> csvw:url <file://input.csv#row=1> in standard mode (i.e. the row URIs extend these urls). See #11 for more discussion.

The practical consequence of hosting the intermediate files is that we could use them for a) debugging (such that the validation report could link to the cell/ column/ row that violated a given rule) and b) as a pre-made tabular serialisation (although this might only be a subset of the observation in a cube).

Note that we might want to ignore this optional persistence until we've implemented csv2rdf standard mode (since tracking cell inputs isn't part of the minimal mode we're targeting at this point).

At the moment, table2qb groups together multiple files by writing/ reading from the same directory. If we're dealing with pipelines that use file-inputs we could a) submit a tar archive (which has the benefit of being a single request) or b) maintain state of partial requests (since the "job" would consist of several requests).

We'd need to be able to run this as a standalone grafter pipeline and within the context of grafter-server.

Validation

It makes sense to distinguish two opportunities for validation:

tabular - can refer to tabular features (cells/ rows/ columns) and parsing issues (infering types from string inputs etc)
graph - can refer to structural issues in the data, particular from the overall context of the whole database

We would want to run tabular validations during the table2qb phase. We could use csvw's tabular metadata specification for describing the validation. Since the definition of "valid" will depend upon the column configuration, we would need to generate such schema based upon what's in the database. Other validations (e.g. all cells are populated) would be context-free. Some examples of criteria:

Are all of the columns recognised (i.e. in the column configuration)?
(for multi-measure cubes) is at least one qb:MeasureProperty column provided?
(for measures-dimension cubes) is the qb:MeasureType property (and no qb:MeasurePropertys) provided?
Are cells provided for all values (where value is either a column without a component-attachment or one representing a measure) provided? We might allow exceptions where a data marker is provided (in another column).
Can all values be parsed correctly? We might also require that they have the correct datatype (rdfs:range of the measure).
Are cells provided for all components?
If the code is specified as a label, can the uri be found (see #18)?
Do the columns (or more specifically the component specs therefrom) match a pre-existing DSD (for partial uploads)?

We would want to run graph validations after the csv2rdf phase. We could use grafter-extra pmd/ qb validators (SPARQL ask), or something like the shapes report we've built for nhs. In any case, the validator should be able to query the live contents of the database.

Are the qb integrity constraints met?
Are pmd's requirements met?
Is all of the pre-requisite reference data loaded (i.e. are all values of components themselves subjects with labels)?
Are all codes members of codelists?
Given a codelist, does the cube provide an observation for each code (warning, not error)?
Do the new observations fit in the old cube (i.e. do the component-properties and order (since this determines observation URIs) match)?
Do any of the uploaded observations clobber old ones (and if so, do they have matching values)?
See draft nhs rdf-shapes for more examples...

Notice that there is some overlap - specifically in looking-up codes and ensuring partial-uploads have a common DSD (and thus observation-uri structure). It might make more sense to do this in the context of the graph (where you can point at the codelists/ DSD that are already loaded) but failing-fast (i.e at the table stage) may be preferable (and we've been able to return understandable errors at this stage in the past).

Robsteranium commented 6 years ago

I've added tasks to Jira:

https://collaborate2.ons.gov.uk/jira/browse/GDP-186 - config in database
https://collaborate2.ons.gov.uk/jira/browse/GDP-187 - orchestrate via grafter apps
https://collaborate2.ons.gov.uk/jira/browse/GDP-176 - tabular validation
https://collaborate2.ons.gov.uk/jira/browse/GDP-188 - graph validation

RickMoynihan commented 6 years ago

runtime parametisation of e.g. name->component (via dependency injection, atom/redef, or configuration monad)

We've been using integrant/duct on zib, drafter etc. It works quite well, and I'd ultimately like to move to using it inside grafter/grafter-server too. If you need this, I'd consider looking into it.

Integrant is the main piece, but duct has some extras; though you're probably best starting small with integrant, and adding aero if need a few more bells and whistles, we currently do this on the stasher branch of drafter. Aero's good to add some extra goodies to integrant for joining strings, pulling in env vars in the config etc prior to initialising with integrant

Swirrl / table2qb