Swirrl / ons-data-export

Temporary repo to keep track of the extraction of data between the PMD3 backed alpha for the COGS project, and the PMD4 staging server.
0 stars 0 forks source link

"Missing" datasets #39

Closed jennet closed 4 years ago

jennet commented 4 years ago

The data extraction done on 2020-04-23 (#33) was a full refresh of the data, which meant that it did not include these datasets:

Number of datasets in PMD4 not in the extraction list: 7
http://gss-data.org.uk/data/gss_data/trade/ons-fdi
http://gss-data.org.uk/data/gss_data/trade/ons-uk-trade-in-goods-by-industry-country-and-commodity
http://gss-data.org.uk/data/gss_data/trade/hmrc_trade
http://gss-data.org.uk/data/gss_data/trade/hmrc_rts
http://gss-data.org.uk/data/ons-pink-book-chapter-3
http://gss-data.org.uk/data/ons-bop-individual-country-data
http://gss-data.org.uk/data/gss_data/health/hmrc_alcohol_bulletin

Looking at the list, some appear to be due to dataset URI changes:

The alcohol bulletin dataset was originally included as it was on the first list of datasets sent over, but since then we decided that the datasets listed with a Publish Status URL would be the source of truth for Swirrl to use as the place to know modified dates and dataset URLs.

This also seems to have affected the inclusion of the datasets:

Can ONS please confirm for each of these datasets whether we should be retrieving updated data from PMD3 for each of them? i.e. what are their last modified dates and URLs on PMD3

ajtucker commented 4 years ago

Some of these are/were missing from the table at https://gss-cogs.github.io/family-trade/datasets/ as we don't necessarily have a 1-1 between published spreadsheets and data cubes.

For data cubes in PMD where there is more than one landing page, we need to a) ensure that all the landing pages are listed in the metadata (DCAT); b) ensure that we match up the landing pages used in fetching/transforming the data with the landing pages listed in Airtable; c) update the tabular view to take all this into account.

I've done all this for the ONS-FDI and updated the tabular view code to cope with potentially multiple dcat:landingPages, so will look into the rest.

ajtucker commented 4 years ago

For http://gss-data.org.uk/data/gss_data/health/hmrc_alcohol_bulletin, we've not included this in https://gss-cogs.github.io/family-trade/datasets/ nor have we worked on it or reviewed it.

However, if possible, we'd like to include the snapshot that was done last year and published in PMD3. Does it require much manual fiddling?

jennet commented 4 years ago

However, if possible, we'd like to include the snapshot that was done last year and published in PMD3. Does it require much manual fiddling?

I'd extracted it for a previous iteration, so I could copy over what data I had and keep track of how long it takes to manually add it. I'll add a separate issue - #43

Edit: Alcohol Duty dataset is now on cogs-staging

jennet commented 4 years ago

ons-fdi is now on cogs-staging

jennet commented 4 years ago

Added Overseas Trade Statistics (CN8) and REGIONAL TRADE STATISTICS to cogs-staging

ajtucker commented 4 years ago

I've been through the table and ensured that everything now links up and all datasets should now show up in the increasingly misnamed "Publish Status" column.

We've moved the HMRC Alcohol Bulletin dataset into the list too, but haven't yet made any particular changes.

Let us know if you see any other mismatches or missing datasets.

jennet commented 4 years ago

@ajtucker The URL looks wrong for "HMRC Trade in Goods" http://gss-data.org.uk/dataset?uri=http%3A%2F%2Fgss-data.org.uk%2Fdata%2Fgss_data%2Ftrade%2Fhmrc_trade%2Fobservations and is getting an error trying to reach it on PMD3

Shall I remove the %2Fobservations from the end of that URL?

jennet commented 4 years ago

From issue #50 : Number of datasets in PMD4 not in the extraction list: 0