alan-turing-institute / uatk-spc

Synthetic Population Catalyst
https://alan-turing-institute.github.io/uatk-spc/
MIT License
20 stars 12 forks source link

Extension to GB + Time projections + Additional values #41

Open HSalat opened 1 year ago

HSalat commented 1 year ago

Tbc.

A sample LAD of the new raw population file, a field dictionary and some comments will be added soon™.

HSalat commented 1 year ago

In order to adapt the code to the new standard, 4 tasks must be performed:

1. Ensure the modified fields do not create conflicts

The following fields have been changed:

See Definitions.txt attached for a full list of variables. Definitions.txt E09000002.csv.zip

2. Cut the data into thematic pieces

I would suggest keeping IDs and core variables (age, sex, ethnicity, nssec8, sic1d2007 and coordinates) as the core block, propose as add-ons:

This is entirely open for discussion and should be reviewed with more concrete information on actual file sizes.

3. Create a mechanic to draw a new diary each day

The file E09000002_Diaries.json.zip attached contains all compatible diaries (referenced by their unique ID) for both working days and weekend days for each individual within the sample LAD. There are several possibilities depending on feasibility: draw a fixed WD and WED for the entire simulation, draw one for each day from the entire sample, draw one for each day from a more restricted sample. This means that the protobuffer should record for each individual a list of n diaries. It would be great then to also record the lockdown in a similar way, and then also specific dates of particular events. The bottom line is that there should now be a tracking of the changing days of the simulation within the output.

The diary themselves will be added later (soon™) inside a separate file. Note that they will contain a type of day field that can come handy should we want to do specific simulations (Christmas period high street max load -> set to no one working on a particular day e.g.).

E09000002_Diaries.json.zip

4. Create a time projection module

Soon, several years instead of one for each LAD will be available. This will incur the necessity to code an array of parameters allowing to modify the variables in order to update their projection into the future. The reference material for this module will be provided in January.

HSalat commented 1 year ago

The diaries and the definitions for their fields can be found below. Note that some field names have been slightly altered, which can have consequences on ASPICS. diariesRef.csv.zip DefinitionsDiaries.txt

As a measure of precaution, I'm attaching the uncommented WIP R codes. code_save_30.11.22.zip

dabreegster commented 1 year ago

lng/lat now refer to the centroid of the OA, no longer the centroid of the MSOA.

I see MSOA11CD and OA11CD are both present now. So we know people at a finer granularity. We need to make a decision here -- should SPC still base everything off the MSOA level? Is the problem just that these per-person coordinates from SPENSER are now at the OA level? We don't use those anyway -- people belong to a household, households belong to an MSOA, we have the MSOA polygon, and so we can calculate its centroid.

The rest of the changes in part 1 make sense and don't look hard, thanks for the precision and sample data.

  1. Cut the data into thematic pieces

So the idea is per study area, we publish several protobuf files. Users load the ones they care about, and the join should be trivial by the ID listed. Is every join bijective -- exactly 1 person matches up with exactly 1 entry in the socioecon file?

Also not opinionated about the exact split; what you proposed seems fine. We organize like this today: https://github.com/alan-turing-institute/uatk-spc/blob/2cf10eb184a47baa4b224df07f8223fe077dd4fc/synthpop.proto#L57

  1. Create a mechanic to draw a new diary each day

Checking my understanding here... every single person will have at least one diaryWD and diaryWE entry?

draw one for each day from the entire sample, draw one for each day from a more restricted sample

The changes in SPC seem straightforward here. Less so in ASPICS, of course. Maybe as a first step in the transition, we arbitrarily use the first weekday entry for ASPICS. How would we decide the "more restricted sample"? Or if we wanted to choose a weekday diary nonuniformly at random, how would we come up with sensible weights?

  1. Create a time projection module

Can we relate one person over many years? If not, it sounds like there are no changes to SPC itself. We would run SPC on the 2020 input, on the 2030 input, etc. The outputs are totally unrelated. It's up to the user to pick the relevant one, and to... know not to try to correlate anything across years?

As a measure of precaution, I'm attaching the uncommented WIP R codes.

This is what git is for! Feel free to start a scripts/data_prep/v2 folder and commit there as frequently as you'd like.

HSalat commented 1 year ago

I see MSOA11CD and OA11CD are both present now. So we know people at a finer granularity. We need to make a decision here -- should SPC still base everything off the MSOA level? Is the problem just that these per-person coordinates from SPENSER are now at the OA level? We don't use those anyway -- people belong to a household, households belong to an MSOA, we have the MSOA polygon, and so we can calculate its centroid.

The coordinates are something I've added myself so that it doesn't need to be done again within SPC. Households belong to both an OA and an MSOA now. The problem is that it's going to increase a lot the size of the distance matrix when computing the commuting modelling at OA level (runtime increase potentially > 100 times since it's squared). Only way to know for sure would be to test at OA, LSOA and MSOA level and compare. If OA is very slow but doable e.g., then we could default to MSOA and leave the rest as an extra parameter (commuting precision) of the model.

So the idea is per study area, we publish several protobuf files. Users load the ones they care about, and the join should be trivial by the ID listed. Is every join bijective -- exactly 1 person matches up with exactly 1 entry in the socioecon file?

My idea would be to let the user decide which themes they want before running and then the model outputs a single protobuf with what's been asked (+ maybe the rest in other .pb just in case?). Not sure what you mean by socioecon file. There are two base files currently for the population: all static characteristics and all compatible diaries, both with 1 row/list item for 1 individual.

As for how to cut, best would be to test before deciding. A reasonable single file should never exceed 2 Gb realistically maybe?

Checking my understanding here... every single person will have at least one diaryWD and diaryWE entry?

Yes. If that's not the case (I had so many things to check it's quite possible I let a few mistakes slide), it's a mistake and I'll fix it in January.

The changes in SPC seem straightforward here. Less so in ASPICS, of course. Maybe as a first step in the transition, we arbitrarily use the first weekday entry for ASPICS. How would we decide the "more restricted sample"? Or if we wanted to choose a weekday diary nonuniformly at random, how would we come up with sensible weights?

We can start this way, but it's an important step towards including events (which should be compatible with the diary of the day). Assuming ONS do their sampling properly, we don't need weights. I was thinking about performance bc the E09000002_Diaries.json.zip file is cumbersome and irregular, so it might be easier to have a neat pop x 10 regular matrix (repeating a few for those individuals that have < 10 options). You're the expert on efficient typing, so your call!

Can we relate one person over many years? If not, it sounds like there are no changes to SPC itself. We would run SPC on the 2020 input, on the 2030 input, etc. The outputs are totally unrelated. It's up to the user to pick the relevant one, and to... know not to try to correlate anything across years?

There exist detailed trend projections for a lot of characteristics, so I intend to build something that morphs the current prob distribution for say age into the projected prob distrib (it's not enough to age everyone by 10 years, bc it would not take into account migration, evolution of natality etc.)

ciupava commented 1 year ago

@HSalat is the code behind this process located just on your laptop? If so are you planning to share it somehow? Thanks!

ciupava commented 1 year ago

here? https://github.com/alan-turing-institute/uatk-aspics

HSalat commented 1 year ago

As a measure of precaution, I'm attaching the uncommented WIP R codes. code_save_30.11.22.zip

I will make a clearer version later, ran out of time

dabreegster commented 1 year ago

I noticed pid and hid changed from integers to including the MSOA. This is intentional and necessary? Repeating the full string may have file size or perf implications downstream. Previously I grouped by (MSOA, HID) on the SPC side. If HID becomes globally unique, I'll change that.

Also, I'm assuming all of the household-level attributes in E09000002.csv are the same for an (MSOA, HID) pair. Any thoughts on deduplicating that kind of information, perhaps by having a separate households.csv file? It's not a big deal since it's an intermediate file format; the final output proto exists exactly to smooth out issues like this.

dabreegster commented 1 year ago

Also, there've been some renamed fields, like pid_tus to id_TUS_hh. You did say

Note that some field names have been slightly altered, which can have consequences on ASPICS.

But any other major changes I should keep in mind? And could you upload the R scripts or whatever else is generating these CSV files, so I can have better insight into the fields?

dabreegster commented 1 year ago

And another important note: please be careful to not overwrite https://ramp0storage.blob.core.windows.net/countydata/pop_greater-london.gz and similar at any point. The current version of SPC relies on these inputs to work, and we want to preserve reproducibility. Any new files can go in a new subdirectory, perhaps versioned like we do with the output, https://ramp0storage.blob.core.windows.net/spc-output/v1.2/durham.pb.gz

dabreegster commented 1 year ago

The new definitions say

hid: Unique household identifier at GB level
        (MSOA11CD + number within MSOA between 00001 and 99999)

Previously, -1 indicated people not matched to a household, which we filtered out. What's the case now?

HSalat commented 1 year ago

I noticed pid and hid changed from integers to including the MSOA. This is intentional and necessary? Repeating the full string may have file size or perf implications downstream. Previously I grouped by (MSOA, HID) on the SPC side. If HID becomes globally unique, I'll change that.

For now I want to keep it open: the idea is that we would be using OA11CD from now on and the field MSOA11CD would become the redundant one, but I'm wary that it might be ambitious to run the modelling of the different flows at the OA scale. In that case, do we keep pre-computed MSOA11CD and LSOA11CD aggregates within the data or do we re-compute it from a look-up when necessary (remember I was considering leaving an option for the user to choose their desired level of precision)? So my stance is to see runtimes for the test area and choose what to keep depending on the results in the final data.

Also, I'm assuming all of the household-level attributes in E09000002.csv are the same for an (MSOA, HID) pair. Any thoughts on deduplicating that kind of information, perhaps by having a separate households.csv file? It's not a big deal since it's an intermediate file format; the final output proto exists exactly to smooth out issues like this.

Previously, -1 indicated people not matched to a household, which we filtered out. What's the case now?

The ones tagged as house should be the same yes (note there are two NSSEC8: one is for the 'head of household' and is duplicated, the other one is individual and drawn from distributions for each individual that is nor the head of household). The reason why I'm merging it is so that I can trim the individuals not matched (there should no longer be -1). I can re-dissociate it if needed.

dabreegster commented 1 year ago

A proposal: why don't you have a go at rewriting synthpop.proto to reflect the new Definitions.txt, being more familiar with the changes?

HSalat commented 1 year ago

because I don't know the syntax, so I wrote a .txt instead containing all the needed information :)

HSalat commented 1 year ago

But any other major changes I should keep in mind? And could you upload the R scripts or whatever else is generating these CSV files, so I can have better insight into the fields?

The R files are above in the thread.

dabreegster commented 1 year ago

because I don't know the syntax, so I wrote a .txt instead containing all the needed information :)

It's not terribly obscure to learn from example (https://github.com/alan-turing-institute/uatk-spc/blob/new_schema/synthpop.proto) and is well-documented (https://developers.google.com/protocol-buffers/docs/proto). I have my hands a bit full rewriting the Rust bits and making the web app and non-SPC projects, and you're the most familiar with the changes to the schema, so it'd be a massive help.

The R files are above in the thread.

Thanks, added to git here: https://github.com/alan-turing-institute/uatk-spc/tree/main/scripts/data_prep/new. Being able to track changes to the scripts over time is massively helpful.

HSalat commented 1 year ago

I've made some changes.

Note that sic1d should be a letter in the output, but converted to the equivalent number when used inside the model.

No diaries yet.

dabreegster commented 1 year ago

Thank you, working on adapting the code now!

pwork and the other per-person time use data is gone. Before we work on diaries, there's one dependency on old time use data... in the commuting logic, we filter for people who spend some of their daily time working. Should we look at PwkStat instead now? Which categories should be used for commuting -- EMPLOYEE_FT, EMPLOYEE_PT, EMPLOYEE_UNSPEC? Also SELF_EMPLOYED?

dabreegster commented 1 year ago

And actually, the schema still has ActivityDuration. If the source data lacks simplified time use now, do we also remove this?

dabreegster commented 1 year ago

I've got things running successfully on the sample data. https://github.com/alan-turing-institute/uatk-spc/compare/main...new_schema is all the code changes so far.

Questions:

About the OA-level commuting: currently we use distance between the point listed in the businesses CSV file (which my notes say is an LSOA centroid?) and whatever's listed in the per-person file (previously MSOA, now OA). There is no caching or batching or perf gains related to the MSOAs of people; I had tried that in #4 previously and didn't get any benefit. So, we will automatically start using OAs for this. If we have a shapefile or similar with all the OAs, we ought to include this in the proto output for convenience, or link to it clearly somewhere.

There's a typo in the E_Rubgy column

I haven't touched diaries yet.

HSalat commented 1 year ago

At present, I would use pwkstat between 1 and 3. I will check later how much it changes the accuracy of the methods.

ActivityDuration should be replaced, but we need to figure out an entire data structure for the daily activities. Same for home_total: ideally, this would have to be recalculated for each day. Any ideas?

The OA shapefile exists, but it has ~175k polygons for GB. Also, it might be overkill for showing histograms in the web app for example. I'll add preparing shapefiles to the TODO list.

dabreegster commented 1 year ago

Same for home_total: ideally, this would have to be recalculated for each day.

Each person references diary IDs, and the diaries contain all the stuff. home_total can be calculated per diary. The question is, the lockdown stuff is calculated once per population file right now, and it uses home_total. Do we want to average over all possible diary entries or something?

I'll add preparing shapefiles to the TODO list.

You're right, maybe we should check file size and load time impact of including OAs. Fernando helped me find https://geoportal.statistics.gov.uk/search?collection=Dataset&sort=name&tags=all(BDY_OA%2CDEC_2021). Clipping them to each area in SPC wouldn't be hard.

HSalat commented 1 year ago

What I'm suggesting is that we would apply the lockdown logic for each day instead of for the entire file. Draw the diaries first, then change mean_pr_home_tot to the same value for the specific day calculated from the diaries drawn. It's more work for SPC to do, but also more reason to use it instead of collecting directly the pre-processed data from azure!

HSalat commented 1 year ago

In a way, SPC should become more about simulating "days in the life of a population" than it should be about simulating "a population". I'd say, match children with schools and employees with office placements permanently, then, for each day, draw the diaries, draw new retail destinations for people with non 0 shopping time and calculate optional activity reduction. I will soon add the month of the year the diary was recorded, and if a diary of that month is available it should be drawn in priority over diaries for other months.

dabreegster commented 1 year ago

I love the idea of dynamically generating a day's activity+travel behavior for everyone. But this is substantial scope creep, and maybe worth a decision about where to implement this logic. With the ASPICS/SPC split, the idea has been to make SPC emit all of the data in a compact format, and put the burden of using it on the consumer. So ASPICS can simulate one person going to multiple schools with different weights, and let another consumer pick a single specific school if they prefer.

I was imagining we just plumb through all the raw data about the diaries. Something downstream can do this daily destination drawing. Maybe it starts life as a module in ASPICS, but is generally useful for other hypothetical users of this.

What lockdown data would a consumer need? We could just emit total_change_per_day instead, letting something downstream decide how to re-scale.

HSalat commented 1 year ago

New reference file for diaries with months: https://ramp0storage.blob.core.windows.net/nationaldata-v2/diariesRef.csv.zip And definitions: https://ramp0storage.blob.core.windows.net/nationaldata-v2/diariesRef_Definitions.txt

dabreegster commented 1 year ago

Those links are broken, and I don't see any files in other directories on Azure. Why not check them into this git repo, so we have them under version control and it's easier to see changes over time?

HSalat commented 1 year ago

That's because I didn't enable public access on Azure. It's fixed now.

HSalat commented 1 year ago

Data extension checklist:

dabreegster commented 1 year ago

To clarify, replacing QUANT is somewhere on my backlog, but not happening anytime soon.

As new input data becomes ready, it'd be great to get it in Azure (under a new v2 directory), so we can start pointing the Rust code at real files somewhere

HSalat commented 1 year ago

Ok, can you add the one currently in use to nationaldata-v2? There are two of them in the other folder and I can't quite remember the history of the one tagged "_spc"

dabreegster commented 1 year ago

Things're coming together! Aside from the countydata files from SPENSER, we also need

HSalat commented 1 year ago

OSM is downloaded inside the model according to the links from the OSM field of the lookup. The new look-up includes Wales and Scotland, but there are no finer sub-region available for those.

Not sure who did those file, I can extract them from the lookup if you want.

dabreegster commented 1 year ago

Ah you're right, the OSM part is already sorted -- the Geofabrik URL is in lookUp-GB.csv.

Generating the config files from the lookup isn't hard. I guess we just need to decide what we provide at https://alan-turing-institute.github.io/uatk-spc/outputs.html -- England counties and some special areas so far.

HSalat commented 1 year ago

1-to-1 with Azure countydata-v2 (under AzureRef inside the lookup), which are:

For special areas, I would add Wales and Scotland to the existing ones.

HSalat commented 1 year ago

Status of missing LADs for England

Known issues (no fix):

SPENSER currently failing:

Will be added next week:

dabreegster commented 1 year ago

I ran all 390 combos of years/regions. 293 success, 97 failures. The failures are mostly Scotland and config/special areas; the cause is missing data mentioned above

HSalat commented 1 year ago

Status of missing LADs for England

Known issues (no fix):

SPENSER currently failing:

mfbenitezp commented 1 year ago

Is it bcs inconsistencies with the area code?

HSalat commented 1 year ago

I'm not sure. I recognise some of the codes from the list of LADs that were scrapped between 2011 and 2020, but there were other reasons why SPENSER was failings for some other LADs. REG will investigate!

sgreenbury commented 1 year ago

The following are now included following fixes to our SPENSER pipeline (further detail spc-hpc-pipeline #31) and SPC (#58):

The updates can be included in a v2.1 release - I'll open a PR for this to capture the changes.