GSS-Cogs / sprint-planning

0 stars 0 forks source link

Versioning #2

Open rossbowen opened 3 years ago

rossbowen commented 3 years ago

Multiple versions (revisions etc. of data on PMDv4 and the capability to transform etc.).

From Rob T:

Under the current way of working, we only make available the latest version of a publication. Under a full publishing model we will need to take into account:

During Q1 2021 we need to consider these items and create a draft technical proposal for how they will be handled.

This has dependencies on things like the Metadata maturity, DOIs etc which may affect the ability to deliver a meaningful proposal, in which case this will move to Q2 2021.

rossbowen commented 3 years ago

Some initial thoughts:

Versioning

Some thoughts around versioning.

Three situations immediately come to mind:

Notes from Code of Practice

The code links to this case study on handling revisions and corrections. Many organisations in government have published their revisions and corrections procedures. Example.

Neither the Code nor the 2007 legislation give explicit instruction on the requirements regarding handling of corrections or revisions.

Best practice has been established - for departments to publish a corrections and revisions policy clearly outlining the steps they will take.

Other pieces of legislation, such as Freedom of Information, may result in departments being required, upon being asked, to provide unrevised versions of their statistics to the public domain. I suppose it is up to debate whether this becomes an explicit requirement of the service, or whether this would be something handled by departments outside of the service.

Updates

A publisher releases many statistical tables as a "compendium release".

Two approaches being seen:

Example 1 (Replacement)

DfT's Transport Statistics Great Britain contains three levels of nesting - a collection contains many "datasets", and each "dataset" contains many "tables" of downloadable .ods data.

Each year, the statistics producer replaces each of the downloadable tables with a new table containing all previous data plus latest year of data available.

DfT have actually created their own interactive data catalog for this release.

I can't find previous versions of sheets on gov.uk.

Structure
Data Model
<transport-statistics-great-britain> a ?? ; # Is this a catalog?
  dct:hasPart <aviation> ;
  dct:hasPart <energy-and-environment> ;
  ...
  .

<aviation> a ?? ; # Is this a catalog?
  dct:identifier "TSGB02" ;
  dct:hasPart <air-traffic-at-uk-airports> ;
  dct:hasPart <air-traffic-by-type-of-serivice-operation-type-and-airport> ;
  ...
  .

<air-traffic-at-uk-airports> a dcat:Dataset, qb:DataSet ;
  dct:identifier "AVI0101" ;
  .

<air-traffic-by-type-of-service-operation-type-and-airport> a dcat:Dataset, qb:DataSet ;
  dct:identifier "AVI0102" ;
  .

# etc.

In this model, since each year new data gets added to the dataset, URIs would not change on a yearly basis. I would assume that if a revision was made, they again would replace the file and put some sort of notification on the page. I would also assume that if a user was interested in previous, unrevised versions of the data, they could approach the department directly.

Example 2 (Add new years separately)

BEIS's Fuel poverty statistics contains four levels of nesting - a collection is partitioned by year, each year contains many "datasets", each "dataset" has many "tables" of downloadable .ods or xlsx data. Some of the headers give links to .pdf reports.

Each year, the statistics producer appends a new set of downloadable tables. The new tables contain only the latest year of information available.

Structure

I'm assuming that splitting out data this way is down to the presentational nature of the tables being produced. It could be that if the producer could be convinced to produce a time series of tidy data that they would be happy to adopt the model that DfT use.

Others

Revisions

Revisions mean many things:

Scheduled Revisions

The Code of Practice T3.9 defines a scheduled revision as a planned amendment to published statistics in order to improve quality by incorporating additional data that were unavailable at the point of initial publication.

Example

Some DfE statistics mark provisional and revised statistics separately and retain both versions on gov.uk.

Unscheduled Corrections

The Code of Practice T3.9 defines a unscheduled correction as an amendment made to published statistics in response to the identification or errors following their initial publication.

Example

Some DWP statistics have been corrected. The corrected files have been uploaded and marked as having been revised, overwriting the incorrect files.

A revision notice has been uploaded indicating to users that a correction has taken place.

Reference Data

I've only spoken about statistical data here. For reference data (where codelists change over time etc.) XKOS has some potentially helpful resources.

RickMoynihan commented 3 years ago

@rossbowen That's very interesting thanks for the detailed write up.

To briefly summarise the two approaches appear to essentially be:

  1. Drop and Replace
  2. Append only (for new years)

I think it's worth exploring what the implications are for republication in IDP. Obviously the person doing the ETL work needs to know which strategy upstream are using, however I'd be curious what your thoughts are on whether upstreams strategy needs to effect IDP's strategy to versioning.

In the absence of corrections or revisions, or structural changes to the upstream data, it should be possible to convert Drop and Replace into Append Only, by essentially removing the previous observations from the new set of new ones. As RDF graphs are idempotent this can happen for free; but I think there's some value in rejecting duplicate observations in a potential future write API (e.g. on add-slice) and forcing the ETL pipeline authors to assemble more minimal change sets. The principle benefit here would be in helping the data engineer understand what changes upstream actually made sooner etc. This would also help highlight cases where a revision or correction was sneaked in through the back door, which might itself be a potential error.

Obviously regardless of the upstream model the harder thing is determining what to do with corrections/revisions, which is a much wider discussion.

Here though, I just wanted to question whether you think upstreams model needs to influence our model, and whether we can through the platform, tooling, data modelling and engineering support a more nuanced model of change than upstream provide. Or whether we always need to just do whatever upstream are doing? I'm clearly gunning for the later.

canwaf commented 3 years ago

From a dimensionality perspective it would be a good thing to look at a dataset like Prevent1 which provides more detailed present-year-only cubes, and less detailed multi-year cubes (i.e. single dimension of the present-year-only cubes) in the annual publication. Presenting the present-year-only cubes as a multi-year cube combined across publications would be a significant value add for the end-user.

Details

For example, for the Prevent1 data for 2019/2020 table 7 provides detailed information of PREVENT1 approaches for 2019/2020 government-year period only and local authority; however table 6 loses the local authority dimension but provides a government-year period from 2014/2015 to 2019/2020 and more detailed reasons for PREVENT1 appraoches.

image

image

There would be a considerable convience improvement by collating Table 7 data as a single cube for across all annual publications.

Robsteranium commented 3 years ago

During the alpha we:

Robsteranium commented 3 years ago

We've gathered requirements for managing dependencies which touch on versioning.

I collated the data consumer Use Cases from the UR we did in the Alpha as part of the trade proposal, with a summary on GSS-Cogs/family-trade#93.

As I understand from @JohnLewisUR, we've not explored publisher requirements yet.