Open rossbowen opened 3 years ago
Some initial thoughts:
Some thoughts around versioning.
Three situations immediately come to mind:
The code links to this case study on handling revisions and corrections. Many organisations in government have published their revisions and corrections procedures. Example.
Neither the Code nor the 2007 legislation give explicit instruction on the requirements regarding handling of corrections or revisions.
Best practice has been established - for departments to publish a corrections and revisions policy clearly outlining the steps they will take.
Other pieces of legislation, such as Freedom of Information, may result in departments being required, upon being asked, to provide unrevised versions of their statistics to the public domain. I suppose it is up to debate whether this becomes an explicit requirement of the service, or whether this would be something handled by departments outside of the service.
Two approaches being seen:
DfT's Transport Statistics Great Britain contains three levels of nesting - a collection contains many "datasets", and each "dataset" contains many "tables" of downloadable .ods
data.
Each year, the statistics producer replaces each of the downloadable tables with a new table containing all previous data plus latest year of data available.
DfT have actually created their own interactive data catalog for this release.
I can't find previous versions of sheets on gov.uk.
<transport-statistics-great-britain> a ?? ; # Is this a catalog?
dct:hasPart <aviation> ;
dct:hasPart <energy-and-environment> ;
...
.
<aviation> a ?? ; # Is this a catalog?
dct:identifier "TSGB02" ;
dct:hasPart <air-traffic-at-uk-airports> ;
dct:hasPart <air-traffic-by-type-of-serivice-operation-type-and-airport> ;
...
.
<air-traffic-at-uk-airports> a dcat:Dataset, qb:DataSet ;
dct:identifier "AVI0101" ;
.
<air-traffic-by-type-of-service-operation-type-and-airport> a dcat:Dataset, qb:DataSet ;
dct:identifier "AVI0102" ;
.
# etc.
In this model, since each year new data gets added to the dataset, URIs would not change on a yearly basis. I would assume that if a revision was made, they again would replace the file and put some sort of notification on the page. I would also assume that if a user was interested in previous, unrevised versions of the data, they could approach the department directly.
BEIS's Fuel poverty statistics contains four levels of nesting - a collection is partitioned by year, each year contains many "datasets", each "dataset" has many "tables" of downloadable .ods
or xlsx
data. Some of the headers give links to .pdf
reports.
Each year, the statistics producer appends a new set of downloadable tables. The new tables contain only the latest year of information available.
I'm assuming that splitting out data this way is down to the presentational nature of the tables being produced. It could be that if the producer could be convinced to produce a time series of tidy data that they would be happy to adopt the model that DfT use.
Revisions mean many things:
The Code of Practice T3.9 defines a scheduled revision as a planned amendment to published statistics in order to improve quality by incorporating additional data that were unavailable at the point of initial publication.
Some DfE statistics mark provisional and revised statistics separately and retain both versions on gov.uk.
The Code of Practice T3.9 defines a unscheduled correction as an amendment made to published statistics in response to the identification or errors following their initial publication.
Some DWP statistics have been corrected. The corrected files have been uploaded and marked as having been revised, overwriting the incorrect files.
A revision notice has been uploaded indicating to users that a correction has taken place.
I've only spoken about statistical data here. For reference data (where codelists change over time etc.) XKOS has some potentially helpful resources.
@rossbowen That's very interesting thanks for the detailed write up.
To briefly summarise the two approaches appear to essentially be:
I think it's worth exploring what the implications are for republication in IDP. Obviously the person doing the ETL work needs to know which strategy upstream are using, however I'd be curious what your thoughts are on whether upstreams strategy needs to effect IDP's strategy to versioning.
In the absence of corrections or revisions, or structural changes to the upstream data, it should be possible to convert Drop and Replace into Append Only, by essentially removing the previous observations from the new set of new ones. As RDF graphs are idempotent this can happen for free; but I think there's some value in rejecting duplicate observations in a potential future write API (e.g. on add-slice
) and forcing the ETL pipeline authors to assemble more minimal change sets. The principle benefit here would be in helping the data engineer understand what changes upstream actually made sooner etc. This would also help highlight cases where a revision or correction was sneaked in through the back door, which might itself be a potential error.
Obviously regardless of the upstream model the harder thing is determining what to do with corrections/revisions, which is a much wider discussion.
Here though, I just wanted to question whether you think upstreams model needs to influence our model, and whether we can through the platform, tooling, data modelling and engineering support a more nuanced model of change than upstream provide. Or whether we always need to just do whatever upstream are doing? I'm clearly gunning for the later.
From a dimensionality perspective it would be a good thing to look at a dataset like Prevent1 which provides more detailed present-year-only cubes, and less detailed multi-year cubes (i.e. single dimension of the present-year-only cubes) in the annual publication. Presenting the present-year-only cubes as a multi-year cube combined across publications would be a significant value add for the end-user.
For example, for the Prevent1 data for 2019/2020 table 7 provides detailed information of PREVENT1 approaches for 2019/2020 government-year period only and local authority; however table 6 loses the local authority dimension but provides a government-year period from 2014/2015 to 2019/2020 and more detailed reasons for PREVENT1 appraoches.
There would be a considerable convience improvement by collating Table 7 data as a single cube for across all annual publications.
During the alpha we:
We've gathered requirements for managing dependencies which touch on versioning.
I collated the data consumer Use Cases from the UR we did in the Alpha as part of the trade proposal, with a summary on GSS-Cogs/family-trade#93.
As I understand from @JohnLewisUR, we've not explored publisher requirements yet.
Multiple versions (revisions etc. of data on PMDv4 and the capability to transform etc.).
From Rob T:
Under the current way of working, we only make available the latest version of a publication. Under a full publishing model we will need to take into account:
During Q1 2021 we need to consider these items and create a draft technical proposal for how they will be handled.
This has dependencies on things like the Metadata maturity, DOIs etc which may affect the ability to deliver a meaningful proposal, in which case this will move to Q2 2021.