paigeh-fsa commented 1 year ago

Currently, we know that the England and Wales script works for 2019, but we are not sure about 2011. Northern Ireland/Scotland work for 2011 so we need this to be consistent for one year. If not, this will need to be fixed.

Use the output of #439

james-westwood commented 1 year ago

Got population estimates for 2011 from National Archives https://webarchive.nationalarchives.gov.uk/ukgwa/20170804143449/https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/datalist?filter=datasets&page=1

james-westwood commented 1 year ago

2002-2012 data is concatenated into one dataset! The age breakdown is missing as is the sex breakdown.

paigeh-fsa commented 1 year ago

PLAN OF ACTION

METHOD 1

Take the year generator out of the functions so it would be at the top of when you start processing e.g.,

for year in years:

proc_each_region_for_all_year() etc....

This will then take the year of interest e.g., 2002 and it will process the CSV for each region at a time and store the columns we need in a dictionary for that one given year. We can then take that dictionary and save it as a pickle, then we have for one year ALL regions. Iterate through each year and we will be able to store a separate dictionary for each year.

Then, we can load the correct year we need and create the final df and save this as a feather.

METHOD 2

Do the same but do not save the pickle (as we could process one dataframe at a time for each year) e.g.,

process CSV for each region for one year and store the dictionary so we have one year ALL regions. Then take dictionary and concatenate and create final df and save as a feather. Delete the dataframe we have created and start the process again with the next year and so on.

NB: I would suggest doing the second method so we don't have to create pickle files when it is not needed. Both methods are the same however the first method saves a pickle file of the dictionary that has one year and all regions and the second doesn't. I'd recommend doing the first approach if the memory 'runs out'.

james-westwood commented 10 months ago

I am splitting this into 2 tickets

Reshape the data from the 2002-2012 archive
Run the pipeline with 2011 data

There might be other tickets generated if for example we find there is other 2011 data missing.

paigeh-fsa commented 4 months ago

We were missing East England data, which I found here: https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/datasets/censusoutputareaestimatesintheeastregionofengland

It's in a different format to the others

paigeh-fsa commented 3 months ago

We have loaded all xlsx files for all years for each region in the duckDB database. Next steps are to extract specific years for all regions in one table for that given year. Create the separate tables based on sex from here. Finally, write these tables into an Excel workbook (xlsx) with each tab being male, female, and persons.

james-westwood commented 3 months ago

Got it to create all tables for each region (containing all years) and then create each individual year's table (with all data from each region).

Now I am trying to create the male, female and persons tables in the age_pop_by_sex function. The query needs to be improved so that the actual age column names are there e.g. "0-5" "90+" etc. at the moment it's just indiviual integers pulled from the config, e.g. 1,2,3 etc.

The query needs to be corrected. Then each of the tables (male, female, persons) written out on three tabs in one workbook.

The query for persons will need to sum male+female for each OACD row.

james-westwood commented 3 months ago

To clarify what I wrote last week, the new source data has integer column names (1, 2, 3...) for ages, up to 89, then we have 90+.

Then I need to group the data by age, possibly using a method like detailed here: https://github.com/ONSdigital/SDG_11.2.1/issues/7#issuecomment-765548063

james-westwood commented 3 months ago

I haven't checked the output of the sql queries but they are executing successfully. The Excel writing function is hanging. Will look into it. Might be a memory problem.

james-westwood commented 2 months ago

Made loads of progress today.

England-Wales is running end-to-end.

Scotland is not running.

Both EW and now Scotland are tripping up on the name of columns in the served_proportions_disagg function. I tried to add some flow control to solve it, but really we need consistent naming in the dataframes.

This should be handled in

replacements = {"males_pop": "male",
                        "fem_pop": "female"}
local_df.rename(columns=replacements, inplace=True)

(on line 112)

but it's not. Possibly the wrong df is being given to the served_proportions_disagg function, meaning that although the correction is applied it's not feeding through to the function.

Also, the output numbers in EW need to be investigated.

jwestw commented 2 months ago

@james-westwood Dedicate 1 more month to this project.

1 day to E&W 3 days to Scotland, NI for 2011.

ONSdigital / SDG_11.2.1

Ensure England and Wales script runs for 2011 and fix if not #419

PLAN OF ACTION