Swirrl / table2qb

A generic pipeline for converting tabular data into rdf data cubes
Eclipse Public License 1.0
14 stars 4 forks source link

Configure column order #36

Open Robsteranium opened 6 years ago

Robsteranium commented 6 years ago

The order of columns in observations input data affects the order in which slugs are inserted into URIs. This means that 2 spreadsheets having different column orders could lead to 2 different URIs for the same observation. We need to make sure that the column order is consistent to avoid partial uploads (e.g. initial insert then subsequent update) creating duplicates.

In the past we have stored the initial order of the upload using the qb:order property of the component spec (this was then later retrieved to enable the re-ordering of subsequent uploads).

In this case we could define this by configuration - e.g. the order the columns are specified in the columns.csv. Once we store the configuration in the database (#21) this would need to be recorded as an explicit column order property.

This configuration would be used to reorder the columns before passing to csv2rdf. We could also create e.g. <compspec1> qb:order 1 triples.

RickMoynihan commented 6 years ago

Added test for this in this PR on ONS:

https://github.com/ONS-OpenData/GDP-vocabs/pull/1

ajtucker commented 6 years ago

I refactored the tests to a separate repo, https://github.com/ONS-OpenData/GDP-tests. Would you be able to add it there?

Robsteranium commented 4 years ago

We've since seen table2qb being used quite often with a different columns-csv for each cube. Without a global columns configuration it wouldn't be possible to have a global ordering of columns.

Indeed it may make more sense anyway to let the qb:order of component-specs be determined by the order columns appear in the input csv.

As above, this means that we would need to provide validation within the database to ensure that partial uploads of a cube were consistent. We could check that each component-spec had only one value of qb:order and this would guarantee consistency of observation URIs, at least for the default template (all bets are off if these are reconfigured #125).

It would be possible for downstream applications like ons-table2qb that wrap table2qb to be more permissive. These applications could query the database to retrieve the order of an existing cube from it's DSD, then re-order the observation csv before passing it to table2qb. This is how sns-graft works: the first upload sets the precedent for the component order and subsequent uploads are pre-processed to match this.