Closed ajtucker closed 3 years ago
Didn't take CDID (we could with some work, but it was quite a lot of work and I didn't know if there was any point).
Extracted tables A-H as per the data suppliers breaksdowns (everything else they class as "other tables"), tons of rediculous layered column headers throughout so gave it a best shot guess.
has an issue regarding data markings not being stripped from observations. Can fix manually for rework, but may (or may not) be a thing that should happen in databaker, so have raised an issue here: https://github.com/GSS-Cogs/databaker/issues/1 and added to backlog.
Asked R for clarification around 4QR terminology and I'm looking for detailed economic info to link to.
A quick write-up on the first four cubes:
Questions for further discussion @mikeAdamss - give me a shout if any of this needs clarifying as I'm popping a bunch of notes on here in case anyone else needs to pick it up.
All cubes
QA checklist:
[x] Does the dataset show up or is there an error? Yes for all cubes
[x] Is it in the list of datasets from the main page? Yes for all cubes
[x] Is there descriptive metadata on the search page? Yes for all cubes
[ ] Does transformed info match the original?
[ ] Does the tidydata download show all the data?
[ ] Note differences in column/row headers/info
ONS geography code for UK added.
Compensation of employees and Gross operating surplus of corporations have been listed under income category.
Income indicator lists the information originally nested under income category ie 'Wages and salaries', 'Employer's social contributions'.
Current prices and chained volume measures have been added as a Estimate type dimension as datasets match up
Footnotes are consistent in this cube for adding metadata
Analysis by asset and Analysis by sector have been grouped under Analysis column
Each tab - chained volume prices and current prices come under Estimate Type
Percentage change not included.
Seasonally adjusted and all four quarters pulled into transform
Category of output tab B1 and Tab B2 Service industry grouped under industrial sector - Category of output has been listed as 'Not Specified' for Tab B1 info.
[x] Are there multiple cubes? List titles of cubes
[ ] What needs differentiating? Totals
[ ] Are there titles that need harmonising? World/Worldwide
[ ] Does the structure look sensible?
[ ] Does the hierarchy work?
[ ] What needs further investigation or context?
CUBE 3 Quarterly National Accounts, GDP – data tables: Gross fixed capital formation Missing Intellectual property products and Total - pulling in Public corporation dwellings and private sector dwellings from Analysis by sector instead of just Analysis by asset. This issue may affect other cubes so will need to go back and re-check.
All cubes:- Percentage change data taken out - assume as it's derivative? If data is to be included - we need to add a new dimension of seasonally adjusted or percentage change, latest year on previous year rather than metadata to ref period dimension.
Where does it make logical sense to add Seasonally adjusted
Percentage increase
etc? Would this work as a new dimension?
[ ] Any duplications?
[ ] List any detailed metadata/methodology to add retrospectively
Future work required:
Decision:
Notes:
CVM meaning: Chained volume measure - series of GDP stats adjusted for inflation to give a real measure of GDP.
CP - current price https://www.economicshelp.org/blog/7397/economics/gdp-at-chained-volume-measure/
Definitions info https://www.gov.uk/government/statistics/final-gdp-cp-and-cvm-quarterly-and-annual-estimates-1997-2013
COICOP - categories of expenditure by individuals - The Classification of Individual Consumption related to cube 5 Quarterly National Accounts, GDP – data tables: Household expenditure indicators.
BA have reviewed one outstanding issue: series still shows "1Q GR" type columns rather than the correct "Quarter on Quarter" type values
Need to summerise all the QA concerns into one list.
I've made the concrete changes I could on this one. Documented as best I can below, this is a complicated one and the "multiple reviewers and no final reviewer" approach hasn't worked particularly well, so happy to take any further steers.
Changes made:
Didn't do:
Year on year
, Quarter on quarter
etc is the percentage change (if that's not clear enough with the new labelling we could maybe add a unit of measure column...if we really, really had to, the measure here are already complicated tbh).Swirrl pulling through updates onto PMDv4.
Measures values and declared measures type issue to be discussed before this can be closed. Potential problem with CDID code duplications.
When I run main.py
from the commandline the script returns Killed
.
@rossbowen - that'll be hitting a system resource threshold (it's a big one), python will kill the process if you go over certain system limits. You'll need to try shutting down everything you don't need (possibly worth restarting first as well) before running. It'll still take a while so one to kick off before going to lunch or somesuch.
moved this one back over, confirmed it runs to completion on my machine so its a laptop resource issue.
@mikeAdamss unsure how to approach this one. Looks like there's lots of .csv
being output with differing structures.
I'm guessing each one will need its own info.json
?
@rossbowen - missed this ping sorry.
iirc there's a lot going on here but most datacubes should have at least some dimensions in common, so I think its one column mapping/info.json (if it works like I think it does).
its an important output we've never gotten our heads around, so might be worth pairing up maybe? can make code tweaks as we squeeze some sense out of it.
Sorry to be a pain but this needs too much fiddling to get right, will be a complete mess in the end. Can we start over and just pull in each sheet as it is without adding anything and add each sheet/table to a list rather than output to a cube at the moment. so the first item in the list will be table A1, second A2, third B1 etc. etc. etc. i can then go through and see if things can be joined or output as it is.
Have published on PMD4 as multiple datasets but some periods are still showing as URIs (1948 to 1959). The periods have been picked up by the ref_periods pipeline and a periods codelist is being created when the quarterly national accounts pipeline runs but labels for some of the periods are not being created properly.
This has been published and checked. Closing issue as gsscogs-bot issues will be dealt with separately.
https://github.com/GSS-Cogs/family-trade/tree/master/datasets/ONS-Quarterly-National-Accounts