ONS-Population-estimates-by-output-areas-electoral-health-and-other-geographies-England-and-Wales

ajtucker commented 4 years ago

Does the publisher call this dimension "gender"? If so, we should change this to "parent" so that the column label is right.

This will stop the filter working for the time being, as we need to provide a qb:codeList for these sub-property dimensions when none is provided, see GSS-Cogs/csvcubed#502.

Tracey-B commented 4 years ago

Following disussions with DM (LP) the name of this dataset has been changed to Population estimates by output area geographies, England and Wales. This has been added to the notes section of Airtable. The Age Type dimension has been removed as this was only added as the information is being extracted from Nomis. The contents issue date needs to be changed to 2018 until the revised publication is available within Nomis. I have contacted the data producer to deterine if a copy of the data provided to Nomis is available to determine if this is a better source.

mikeAdamss commented 4 years ago

just catching up on comments:

@Tracey-B - I think Age might be a differentiator, as-in if I take it out we may end up a bazillion observations with the same dimensions, might be wrong but that's what's it looks like.

@LPerryman - I've struck the "put the data on the cloud" script here: https://github.com/GSS-Cogs/data-streaming but that's as far as I can go with this, it's basically blocked until the second part of "stream big stuff" is prioritised/done.

Tracey-B commented 4 years ago

@mikeAdamss re the following comment:

@Tracey-B - I think Age might be a differentiator, as-in if I take it out we may end up a bazillion observations with the same dimensions, might be wrong but that's what's it looks like.

It is the age type 'Labour Market categories' that has been removed not Age , give me a call if you want to discuss?

mikeAdamss commented 4 years ago

thanks Tracey, that makes perfect sense, I misunderstood/misread your comment, all good 👍

LPerryman commented 4 years ago

Senor Mike A has pulled in the data and separated out into 21 files https://console.cloud.google.com/storage/browser/pipeline-stream-population-estimates;tab=objects?forceOnBucketsSortingFiltering=false&authuser=0&project=optimum-bonbon-257411&prefix=&forceOnObjectsSortingFiltering=false

Senor Leigh has pulled in those files, formatted them into a single cube and output to the pipeline folder as a zipped file. This was done in the main.py script but has been commented out so it does not run in Jenkins. https://github.com/GSS-Cogs/family-towns-and-high-streets/tree/master/datasets/ONS-Population-estimates-by-output-areas-electoral-health-and-other-geographies-England-and-Wales/out

The CSVMapping class needs a CSV file to create the .json and .trig files so a dummy dataframe (to match the main dataset) is created and output. Once CSVMapping() has finished the file is deleted. https://github.com/GSS-Cogs/family-towns-and-high-streets/blob/master/datasets/ONS-Population-estimates-by-output-areas-electoral-health-and-other-geographies-England-and-Wales/main.py

Senor Alex is currently coding groovy script to take account of the zipped files

ajtucker commented 4 years ago

Thanks @LPerryman !

The CSVWMapping workaround has an issue at GSS-Cogs/gss-utils#76 that wouldn't be too hard to implement.

The Groovy script has run and uploaded the 5 million odd observations to PMD. The draft looks ok, apart from issues below. I'm currently attempting to publish the results so we can all see.

[ ] The counts are coming through as .0 decimals, despite us saying the column values are integers. I think this is down to Pandas outputting the values in CSV as floats, so we should coerce the column to integer before outputting.
[ ] The age codelist isn't currently being uploaded by the Jenkins job -- we could gzip the ttl and put it in the out directory and it should get uploaded.
[ ] The age labels seem to be showing up with their skos:notation rather than their rdfs:label, which is odd.

LPerryman commented 4 years ago

Code that creates zipped data cube has been rerun and Value column converted to integer. Age codelist has also been slightly changed (Aged 16 -> Aged 16 plus)

LPerryman commented 4 years ago

Data has been published to PMD4 Population estimates by Output Area Geographies, Gender and Age, England and Wales https://staging.gss-data.org.uk/cube/explore?uri=http%3A%2F%2Fgss-data.org.uk%2Fdata%2Fgss_data%2Ftowns-high-streets%2Fons-population-estimates-by-output-areas-electoral-health-and-other-geographies-england-and-wales-catalog-entry

GSS-Cogs / family-towns-and-high-streets

ONS-Population-estimates-by-output-areas-electoral-health-and-other-geographies-England-and-Wales #16