rossbowen commented 3 years ago

Multiple versions (revisions etc. of data on PMDv4 and the capability to transform etc.).

From Rob T:

Under the current way of working, we only make available the latest version of a publication. Under a full publishing model we will need to take into account:

Publishing each release of data, keeping previously published versions available, but steering users to the latest version
Publication of corrections, along with associated correction notices. A correction is where an error is found in a release, so a corrected version is published in its place.
Revisions. Where a release includes a time series, observations can be revised within a new publication due to better data. This could be data moving from provisional to final, or just more complete administrative data from things like VAT and tax returns. These revisions are flagged to indicate the revision.

During Q1 2021 we need to consider these items and create a draft technical proposal for how they will be handled.

This has dependencies on things like the Metadata maturity, DOIs etc which may affect the ability to deliver a meaningful proposal, in which case this will move to Q2 2021.

[x] Initial thoughts
[x] Develop proposal for URI patterns.

rossbowen commented 3 years ago

Some initial thoughts:

Versioning

Some thoughts around versioning.

Three situations immediately come to mind:

Updates (the latest year's data is added to a compendium release),
Scheduled Revisions (to take provisional or estimated statistics, and update these with more accurate results once more data becomes available),
Unscheduled Corrections (to replace a dataset where corrections are needed).

Notes from Code of Practice

The Statistics and Registration Service Act 2007 instructs the board (who for the purposes of the legislation are the UK Statistics Authority) to prepare a Code of Practice for Statistics.
The Code (T3.6) instructs all statistics to be released at 9:30am on a weekday.
The Code (Q3.4) instructs that scheduled revisions, or unscheduled corrections that result from errors, should be explained alongside the statistics, being clear on the scale, nature, cause and impact.

The code links to this case study on handling revisions and corrections. Many organisations in government have published their revisions and corrections procedures. Example.

Neither the Code nor the 2007 legislation give explicit instruction on the requirements regarding handling of corrections or revisions.

Best practice has been established - for departments to publish a corrections and revisions policy clearly outlining the steps they will take.

Other pieces of legislation, such as Freedom of Information, may result in departments being required, upon being asked, to provide unrevised versions of their statistics to the public domain. I suppose it is up to debate whether this becomes an explicit requirement of the service, or whether this would be something handled by departments outside of the service.

Updates

A publisher releases many statistical tables as a "compendium release".

Two approaches being seen:

The publisher removes old files and replaces them with new files containing new data. Data is typically a time series (so replacing files with new data does not lose old data - the new data is appended to the dataset alongside older data).
The publisher adds new files each year containing new data. Each file typically only contains data for that period (so lots of files would need to be concatenated to get a time series).

Example 1 (Replacement)

DfT's Transport Statistics Great Britain contains three levels of nesting - a collection contains many "datasets", and each "dataset" contains many "tables" of downloadable .ods data.

Each year, the statistics producer replaces each of the downloadable tables with a new table containing all previous data plus latest year of data available.

DfT have actually created their own interactive data catalog for this release.

I can't find previous versions of sheets on gov.uk.

Structure

Transport Statistics Great Britain
- Aviation (TSGB02)
  - TSGB0201 (AVI0101): Air traffic at UK airports
  - TSGB0202 (AVI0102): Air traffic by type of service, operation type and airport
  - TSGB0203 (AVI0103): Punctuality at selected UK airports: time series
- Energy and environment (TSGB03)
  - etc.
- Freight (TSGB04)
  - etc.
- etc.

Data Model

<transport-statistics-great-britain> a ?? ; # Is this a catalog?
  dct:hasPart <aviation> ;
  dct:hasPart <energy-and-environment> ;
  ...
  .

<aviation> a ?? ; # Is this a catalog?
  dct:identifier "TSGB02" ;
  dct:hasPart <air-traffic-at-uk-airports> ;
  dct:hasPart <air-traffic-by-type-of-serivice-operation-type-and-airport> ;
  ...
  .

<air-traffic-at-uk-airports> a dcat:Dataset, qb:DataSet ;
  dct:identifier "AVI0101" ;
  .

<air-traffic-by-type-of-service-operation-type-and-airport> a dcat:Dataset, qb:DataSet ;
  dct:identifier "AVI0102" ;
  .

# etc.

In this model, since each year new data gets added to the dataset, URIs would not change on a yearly basis. I would assume that if a revision was made, they again would replace the file and put some sort of notification on the page. I would also assume that if a user was interested in previous, unrevised versions of the data, they could approach the department directly.

Example 2 (Add new years separately)

BEIS's Fuel poverty statistics contains four levels of nesting - a collection is partitioned by year, each year contains many "datasets", each "dataset" has many "tables" of downloadable .ods or xlsx data. Some of the headers give links to .pdf reports.

Each year, the statistics producer appends a new set of downloadable tables. The new tables contain only the latest year of information available.

Structure

Fuel Poverty Statistics
- 2019 statistics
  - Annual fuel poverty statistics report: 2021
    - report.pdf
  - Fuel poverty detailed tables 2021
    - 2019 fuel poverty detailed tables under the Low Income Low Energy Efficiency (LILEE) indicator (Excel)
    - 2019 fuel poverty detailed tables under the Low Income Low Energy Efficiency (LILEE) indicator (ODS)
    - 2019 fuel poverty detailed tables under the Low Income High Costs (LIHC) indicator (Excel)
    - 2019 fuel poverty detailed tables under the Low Income High Costs (LIHC) indicator (ODS)
  - Fuel poverty supplementary tables 2021
    - etc.
  - Fuel poverty trends 2021
    - etc.
  - Fuel poverty factsheet 2021
    - factsheet.pdf
- 2018 statistics
  - Annual fuel poverty statistics report: 2020
    - etc.
  - Fuel poverty detailed tables 2020
    - etc.
  - etc.

I'm assuming that splitting out data this way is down to the presentational nature of the tables being produced. It could be that if the producer could be convinced to produce a time series of tidy data that they would be happy to adopt the model that DfT use.

Others

Prison Population Statistics (similar approach to BEIS).

Revisions

Revisions mean many things:

To take provisional or estimated statistics, and update these with more accurate results once more data becomes available (scheduled revisions),
To replace a dataset where corrections are needed (unscheduled corrections).

Scheduled Revisions

The Code of Practice T3.9 defines a scheduled revision as a planned amendment to published statistics in order to improve quality by incorporating additional data that were unavailable at the point of initial publication.

Example

Some DfE statistics mark provisional and revised statistics separately and retain both versions on gov.uk.

Unscheduled Corrections

The Code of Practice T3.9 defines a unscheduled correction as an amendment made to published statistics in response to the identification or errors following their initial publication.

Example

Some DWP statistics have been corrected. The corrected files have been uploaded and marked as having been revised, overwriting the incorrect files.

A revision notice has been uploaded indicating to users that a correction has taken place.

Reference Data

I've only spoken about statistical data here. For reference data (where codelists change over time etc.) XKOS has some potentially helpful resources.

RickMoynihan commented 3 years ago

@rossbowen That's very interesting thanks for the detailed write up.

To briefly summarise the two approaches appear to essentially be:

Drop and Replace
Append only (for new years)

I think it's worth exploring what the implications are for republication in IDP. Obviously the person doing the ETL work needs to know which strategy upstream are using, however I'd be curious what your thoughts are on whether upstreams strategy needs to effect IDP's strategy to versioning.

In the absence of corrections or revisions, or structural changes to the upstream data, it should be possible to convert Drop and Replace into Append Only, by essentially removing the previous observations from the new set of new ones. As RDF graphs are idempotent this can happen for free; but I think there's some value in rejecting duplicate observations in a potential future write API (e.g. on add-slice) and forcing the ETL pipeline authors to assemble more minimal change sets. The principle benefit here would be in helping the data engineer understand what changes upstream actually made sooner etc. This would also help highlight cases where a revision or correction was sneaked in through the back door, which might itself be a potential error.

Obviously regardless of the upstream model the harder thing is determining what to do with corrections/revisions, which is a much wider discussion.

Here though, I just wanted to question whether you think upstreams model needs to influence our model, and whether we can through the platform, tooling, data modelling and engineering support a more nuanced model of change than upstream provide. Or whether we always need to just do whatever upstream are doing? I'm clearly gunning for the later.

canwaf commented 3 years ago

From a dimensionality perspective it would be a good thing to look at a dataset like Prevent1 which provides more detailed present-year-only cubes, and less detailed multi-year cubes (i.e. single dimension of the present-year-only cubes) in the annual publication. Presenting the present-year-only cubes as a multi-year cube combined across publications would be a significant value add for the end-user.

Details

For example, for the Prevent1 data for 2019/2020 table 7 provides detailed information of PREVENT1 approaches for 2019/2020 government-year period only and local authority; however table 6 loses the local authority dimension but provides a government-year period from 2014/2015 to 2019/2020 and more detailed reasons for PREVENT1 appraoches.

There would be a considerable convience improvement by collating Table 7 data as a single cube for across all annual publications.

Robsteranium commented 3 years ago

During the alpha we:

considered some alternate RDF models for revisions,
reviewed ideas for data markers (e.g. for annotations with corrections/ errors etc), and
explored using XKOS to forge correspondence links (e.g across codelist versions).

Robsteranium commented 3 years ago

We've gathered requirements for managing dependencies which touch on versioning.

I collated the data consumer Use Cases from the UR we did in the Alpha as part of the trade proposal, with a summary on GSS-Cogs/family-trade#93.

As I understand from @JohnLewisUR, we've not explored publisher requirements yet.

GSS-Cogs / sprint-planning

Versioning #2

Versioning

Notes from Code of Practice

Updates

A publisher releases many statistical tables as a "compendium release".

Example 1 (Replacement)

Structure

Data Model

Example 2 (Add new years separately)

Structure

Others

Revisions

Scheduled Revisions

Example

Unscheduled Corrections

Example

Reference Data

Details