Open ajtucker opened 4 years ago
Following disussions with DM (LP) the name of this dataset has been changed to Population estimates by output area geographies, England and Wales. This has been added to the notes section of Airtable. The Age Type dimension has been removed as this was only added as the information is being extracted from Nomis. The contents issue date needs to be changed to 2018 until the revised publication is available within Nomis. I have contacted the data producer to deterine if a copy of the data provided to Nomis is available to determine if this is a better source.
just catching up on comments:
@Tracey-B - I think Age might be a differentiator, as-in if I take it out we may end up a bazillion observations with the same dimensions, might be wrong but that's what's it looks like.
@LPerryman - I've struck the "put the data on the cloud" script here: https://github.com/GSS-Cogs/data-streaming but that's as far as I can go with this, it's basically blocked until the second part of "stream big stuff" is prioritised/done.
@mikeAdamss re the following comment:
@Tracey-B - I think Age might be a differentiator, as-in if I take it out we may end up a bazillion observations with the same dimensions, might be wrong but that's what's it looks like.
It is the age type 'Labour Market categories' that has been removed not Age , give me a call if you want to discuss?
thanks Tracey, that makes perfect sense, I misunderstood/misread your comment, all good 👍
Senor Mike A has pulled in the data and separated out into 21 files https://console.cloud.google.com/storage/browser/pipeline-stream-population-estimates;tab=objects?forceOnBucketsSortingFiltering=false&authuser=0&project=optimum-bonbon-257411&prefix=&forceOnObjectsSortingFiltering=false
Senor Leigh has pulled in those files, formatted them into a single cube and output to the pipeline folder as a zipped file. This was done in the main.py script but has been commented out so it does not run in Jenkins. https://github.com/GSS-Cogs/family-towns-and-high-streets/tree/master/datasets/ONS-Population-estimates-by-output-areas-electoral-health-and-other-geographies-England-and-Wales/out
The CSVMapping class needs a CSV file to create the .json and .trig files so a dummy dataframe (to match the main dataset) is created and output. Once CSVMapping() has finished the file is deleted. https://github.com/GSS-Cogs/family-towns-and-high-streets/blob/master/datasets/ONS-Population-estimates-by-output-areas-electoral-health-and-other-geographies-England-and-Wales/main.py
Senor Alex is currently coding groovy script to take account of the zipped files
Thanks @LPerryman !
The CSVWMapping workaround has an issue at GSS-Cogs/gss-utils#76 that wouldn't be too hard to implement.
The Groovy script has run and uploaded the 5 million odd observations to PMD. The draft looks ok, apart from issues below. I'm currently attempting to publish the results so we can all see.
.0
decimals, despite us saying the column values are integers. I think this is down to Pandas outputting the values in CSV as floats, so we should coerce the column to integer before outputting.out
directory and it should get uploaded.skos:notation
rather than their rdfs:label
, which is odd.Code that creates zipped data cube has been rerun and Value column converted to integer. Age codelist has also been slightly changed (Aged 16 -> Aged 16 plus)
Data has been published to PMD4 Population estimates by Output Area Geographies, Gender and Age, England and Wales https://staging.gss-data.org.uk/cube/explore?uri=http%3A%2F%2Fgss-data.org.uk%2Fdata%2Fgss_data%2Ftowns-high-streets%2Fons-population-estimates-by-output-areas-electoral-health-and-other-geographies-england-and-wales-catalog-entry
Does the publisher call this dimension "gender"? If so, we should change this to "parent" so that the column label is right.
This will stop the filter working for the time being, as we need to provide a
qb:codeList
for these sub-property dimensions when none is provided, see GSS-Cogs/csvcubed#502.