Open HSalat opened 1 year ago
In order to adapt the code to the new standard, 4 tasks must be performed:
1. Ensure the modified fields do not create conflicts
The following fields have been changed:
sex
is now {1,2}, please keep it that way and refrain from changing it to a {0,1} variable, as it is coherent with all ONS data ;)sic1d2007
is now a letter as the official value is. In order for the commuting modelling to keep working, this needs to be transformed into the corresponding number in the alphabetical order (but the output in the protobuffer should remain the letter).lng/lat
now refer to the centroid of the OA, no longer the centroid of the MSOA. This can potentially make the computation of commuting trips much longer. If too slow, aggregation at LSOA or MSOA level might be necessary. Maybe keep the option open as a new parameter?NA
values are now -1
.See Definitions.txt
attached for a full list of variables.
Definitions.txt
E09000002.csv.zip
2. Cut the data into thematic pieces
I would suggest keeping IDs and core variables (age, sex, ethnicity, nssec8, sic1d2007 and coordinates) as the core block, propose as add-ons:
HOUSE_
)id_HS
)id_TUS_
)EXxxxxxxx
)This is entirely open for discussion and should be reviewed with more concrete information on actual file sizes.
3. Create a mechanic to draw a new diary each day
The file E09000002_Diaries.json.zip
attached contains all compatible diaries (referenced by their unique ID) for both working days and weekend days for each individual within the sample LAD. There are several possibilities depending on feasibility: draw a fixed WD and WED for the entire simulation, draw one for each day from the entire sample, draw one for each day from a more restricted sample. This means that the protobuffer should record for each individual a list of n diaries. It would be great then to also record the lockdown in a similar way, and then also specific dates of particular events. The bottom line is that there should now be a tracking of the changing days of the simulation within the output.
The diary themselves will be added later (soon™) inside a separate file. Note that they will contain a type of day
field that can come handy should we want to do specific simulations (Christmas period high street max load -> set to no one working on a particular day e.g.).
4. Create a time projection module
Soon, several years instead of one for each LAD will be available. This will incur the necessity to code an array of parameters allowing to modify the variables in order to update their projection into the future. The reference material for this module will be provided in January.
The diaries and the definitions for their fields can be found below. Note that some field names have been slightly altered, which can have consequences on ASPICS. diariesRef.csv.zip DefinitionsDiaries.txt
As a measure of precaution, I'm attaching the uncommented WIP R codes. code_save_30.11.22.zip
lng/lat now refer to the centroid of the OA, no longer the centroid of the MSOA.
I see MSOA11CD
and OA11CD
are both present now. So we know people at a finer granularity. We need to make a decision here -- should SPC still base everything off the MSOA level? Is the problem just that these per-person coordinates from SPENSER are now at the OA level? We don't use those anyway -- people belong to a household, households belong to an MSOA, we have the MSOA polygon, and so we can calculate its centroid.
The rest of the changes in part 1 make sense and don't look hard, thanks for the precision and sample data.
- Cut the data into thematic pieces
So the idea is per study area, we publish several protobuf files. Users load the ones they care about, and the join should be trivial by the ID listed. Is every join bijective -- exactly 1 person matches up with exactly 1 entry in the socioecon file?
Also not opinionated about the exact split; what you proposed seems fine. We organize like this today: https://github.com/alan-turing-institute/uatk-spc/blob/2cf10eb184a47baa4b224df07f8223fe077dd4fc/synthpop.proto#L57
- Create a mechanic to draw a new diary each day
Checking my understanding here... every single person will have at least one diaryWD
and diaryWE
entry?
draw one for each day from the entire sample, draw one for each day from a more restricted sample
The changes in SPC seem straightforward here. Less so in ASPICS, of course. Maybe as a first step in the transition, we arbitrarily use the first weekday entry for ASPICS. How would we decide the "more restricted sample"? Or if we wanted to choose a weekday diary nonuniformly at random, how would we come up with sensible weights?
- Create a time projection module
Can we relate one person over many years? If not, it sounds like there are no changes to SPC itself. We would run SPC on the 2020 input, on the 2030 input, etc. The outputs are totally unrelated. It's up to the user to pick the relevant one, and to... know not to try to correlate anything across years?
As a measure of precaution, I'm attaching the uncommented WIP R codes.
This is what git is for! Feel free to start a scripts/data_prep/v2
folder and commit there as frequently as you'd like.
I see
MSOA11CD
andOA11CD
are both present now. So we know people at a finer granularity. We need to make a decision here -- should SPC still base everything off the MSOA level? Is the problem just that these per-person coordinates from SPENSER are now at the OA level? We don't use those anyway -- people belong to a household, households belong to an MSOA, we have the MSOA polygon, and so we can calculate its centroid.
The coordinates are something I've added myself so that it doesn't need to be done again within SPC. Households belong to both an OA and an MSOA now. The problem is that it's going to increase a lot the size of the distance matrix when computing the commuting modelling at OA level (runtime increase potentially > 100 times since it's squared). Only way to know for sure would be to test at OA, LSOA and MSOA level and compare. If OA is very slow but doable e.g., then we could default to MSOA and leave the rest as an extra parameter (commuting precision) of the model.
So the idea is per study area, we publish several protobuf files. Users load the ones they care about, and the join should be trivial by the ID listed. Is every join bijective -- exactly 1 person matches up with exactly 1 entry in the socioecon file?
My idea would be to let the user decide which themes they want before running and then the model outputs a single protobuf with what's been asked (+ maybe the rest in other .pb just in case?). Not sure what you mean by socioecon file. There are two base files currently for the population: all static characteristics and all compatible diaries, both with 1 row/list item for 1 individual.
As for how to cut, best would be to test before deciding. A reasonable single file should never exceed 2 Gb realistically maybe?
Checking my understanding here... every single person will have at least one
diaryWD
anddiaryWE
entry?
Yes. If that's not the case (I had so many things to check it's quite possible I let a few mistakes slide), it's a mistake and I'll fix it in January.
The changes in SPC seem straightforward here. Less so in ASPICS, of course. Maybe as a first step in the transition, we arbitrarily use the first weekday entry for ASPICS. How would we decide the "more restricted sample"? Or if we wanted to choose a weekday diary nonuniformly at random, how would we come up with sensible weights?
We can start this way, but it's an important step towards including events (which should be compatible with the diary of the day). Assuming ONS do their sampling properly, we don't need weights. I was thinking about performance bc the E09000002_Diaries.json.zip
file is cumbersome and irregular, so it might be easier to have a neat pop x 10 regular matrix (repeating a few for those individuals that have < 10 options). You're the expert on efficient typing, so your call!
Can we relate one person over many years? If not, it sounds like there are no changes to SPC itself. We would run SPC on the 2020 input, on the 2030 input, etc. The outputs are totally unrelated. It's up to the user to pick the relevant one, and to... know not to try to correlate anything across years?
There exist detailed trend projections for a lot of characteristics, so I intend to build something that morphs the current prob distribution for say age into the projected prob distrib (it's not enough to age everyone by 10 years, bc it would not take into account migration, evolution of natality etc.)
@HSalat is the code behind this process located just on your laptop? If so are you planning to share it somehow? Thanks!
As a measure of precaution, I'm attaching the uncommented WIP R codes. code_save_30.11.22.zip
I will make a clearer version later, ran out of time
I noticed pid
and hid
changed from integers to including the MSOA. This is intentional and necessary? Repeating the full string may have file size or perf implications downstream. Previously I grouped by (MSOA, HID) on the SPC side. If HID becomes globally unique, I'll change that.
Also, I'm assuming all of the household-level attributes in E09000002.csv
are the same for an (MSOA, HID) pair. Any thoughts on deduplicating that kind of information, perhaps by having a separate households.csv
file? It's not a big deal since it's an intermediate file format; the final output proto exists exactly to smooth out issues like this.
Also, there've been some renamed fields, like pid_tus
to id_TUS_hh
. You did say
Note that some field names have been slightly altered, which can have consequences on ASPICS.
But any other major changes I should keep in mind? And could you upload the R scripts or whatever else is generating these CSV files, so I can have better insight into the fields?
And another important note: please be careful to not overwrite https://ramp0storage.blob.core.windows.net/countydata/pop_greater-london.gz and similar at any point. The current version of SPC relies on these inputs to work, and we want to preserve reproducibility. Any new files can go in a new subdirectory, perhaps versioned like we do with the output, https://ramp0storage.blob.core.windows.net/spc-output/v1.2/durham.pb.gz
The new definitions say
hid: Unique household identifier at GB level
(MSOA11CD + number within MSOA between 00001 and 99999)
Previously, -1
indicated people not matched to a household, which we filtered out. What's the case now?
I noticed
pid
andhid
changed from integers to including the MSOA. This is intentional and necessary? Repeating the full string may have file size or perf implications downstream. Previously I grouped by (MSOA, HID) on the SPC side. If HID becomes globally unique, I'll change that.
For now I want to keep it open: the idea is that we would be using OA11CD from now on and the field MSOA11CD would become the redundant one, but I'm wary that it might be ambitious to run the modelling of the different flows at the OA scale. In that case, do we keep pre-computed MSOA11CD and LSOA11CD aggregates within the data or do we re-compute it from a look-up when necessary (remember I was considering leaving an option for the user to choose their desired level of precision)? So my stance is to see runtimes for the test area and choose what to keep depending on the results in the final data.
Also, I'm assuming all of the household-level attributes in
E09000002.csv
are the same for an (MSOA, HID) pair. Any thoughts on deduplicating that kind of information, perhaps by having a separatehouseholds.csv
file? It's not a big deal since it's an intermediate file format; the final output proto exists exactly to smooth out issues like this.Previously, -1 indicated people not matched to a household, which we filtered out. What's the case now?
The ones tagged as house
should be the same yes (note there are two NSSEC8: one is for the 'head of household' and is duplicated, the other one is individual and drawn from distributions for each individual that is nor the head of household). The reason why I'm merging it is so that I can trim the individuals not matched (there should no longer be -1). I can re-dissociate it if needed.
A proposal: why don't you have a go at rewriting synthpop.proto
to reflect the new Definitions.txt
, being more familiar with the changes?
because I don't know the syntax, so I wrote a .txt instead containing all the needed information :)
But any other major changes I should keep in mind? And could you upload the R scripts or whatever else is generating these CSV files, so I can have better insight into the fields?
The R files are above in the thread.
because I don't know the syntax, so I wrote a .txt instead containing all the needed information :)
It's not terribly obscure to learn from example (https://github.com/alan-turing-institute/uatk-spc/blob/new_schema/synthpop.proto) and is well-documented (https://developers.google.com/protocol-buffers/docs/proto). I have my hands a bit full rewriting the Rust bits and making the web app and non-SPC projects, and you're the most familiar with the changes to the schema, so it'd be a massive help.
The R files are above in the thread.
Thanks, added to git here: https://github.com/alan-turing-institute/uatk-spc/tree/main/scripts/data_prep/new. Being able to track changes to the scripts over time is massively helpful.
I've made some changes.
Note that sic1d
should be a letter in the output, but converted to the equivalent number when used inside the model.
No diaries yet.
Thank you, working on adapting the code now!
pwork
and the other per-person time use data is gone. Before we work on diaries, there's one dependency on old time use data... in the commuting logic, we filter for people who spend some of their daily time working. Should we look at PwkStat
instead now? Which categories should be used for commuting -- EMPLOYEE_FT, EMPLOYEE_PT, EMPLOYEE_UNSPEC? Also SELF_EMPLOYED?
And actually, the schema still has ActivityDuration
. If the source data lacks simplified time use now, do we also remove this?
I've got things running successfully on the sample data. https://github.com/alan-turing-institute/uatk-spc/compare/main...new_schema is all the code changes so far.
Questions:
home_total
. What now?About the OA-level commuting: currently we use distance between the point listed in the businesses CSV file (which my notes say is an LSOA centroid?) and whatever's listed in the per-person file (previously MSOA, now OA). There is no caching or batching or perf gains related to the MSOAs of people; I had tried that in #4 previously and didn't get any benefit. So, we will automatically start using OAs for this. If we have a shapefile or similar with all the OAs, we ought to include this in the proto output for convenience, or link to it clearly somewhere.
There's a typo in the E_Rubgy
column
I haven't touched diaries yet.
At present, I would use pwkstat
between 1 and 3. I will check later how much it changes the accuracy of the methods.
ActivityDuration
should be replaced, but we need to figure out an entire data structure for the daily activities. Same for home_total
: ideally, this would have to be recalculated for each day. Any ideas?
The OA shapefile exists, but it has ~175k polygons for GB. Also, it might be overkill for showing histograms in the web app for example. I'll add preparing shapefiles to the TODO list.
Same for home_total: ideally, this would have to be recalculated for each day.
Each person references diary IDs, and the diaries contain all the stuff. home_total can be calculated per diary. The question is, the lockdown stuff is calculated once per population file right now, and it uses home_total. Do we want to average over all possible diary entries or something?
I'll add preparing shapefiles to the TODO list.
You're right, maybe we should check file size and load time impact of including OAs. Fernando helped me find https://geoportal.statistics.gov.uk/search?collection=Dataset&sort=name&tags=all(BDY_OA%2CDEC_2021). Clipping them to each area in SPC wouldn't be hard.
What I'm suggesting is that we would apply the lockdown logic for each day instead of for the entire file. Draw the diaries first, then change mean_pr_home_tot
to the same value for the specific day calculated from the diaries drawn. It's more work for SPC to do, but also more reason to use it instead of collecting directly the pre-processed data from azure!
In a way, SPC should become more about simulating "days in the life of a population" than it should be about simulating "a population". I'd say, match children with schools and employees with office placements permanently, then, for each day, draw the diaries, draw new retail destinations for people with non 0 shopping time and calculate optional activity reduction. I will soon add the month of the year the diary was recorded, and if a diary of that month is available it should be drawn in priority over diaries for other months.
I love the idea of dynamically generating a day's activity+travel behavior for everyone. But this is substantial scope creep, and maybe worth a decision about where to implement this logic. With the ASPICS/SPC split, the idea has been to make SPC emit all of the data in a compact format, and put the burden of using it on the consumer. So ASPICS can simulate one person going to multiple schools with different weights, and let another consumer pick a single specific school if they prefer.
I was imagining we just plumb through all the raw data about the diaries. Something downstream can do this daily destination drawing. Maybe it starts life as a module in ASPICS, but is generally useful for other hypothetical users of this.
What lockdown data would a consumer need? We could just emit total_change_per_day instead, letting something downstream decide how to re-scale.
New reference file for diaries with months: https://ramp0storage.blob.core.windows.net/nationaldata-v2/diariesRef.csv.zip And definitions: https://ramp0storage.blob.core.windows.net/nationaldata-v2/diariesRef_Definitions.txt
Those links are broken, and I don't see any files in other directories on Azure. Why not check them into this git repo, so we have them under version control and it's easier to see changes over time?
That's because I didn't enable public access on Azure. It's fixed now.
Data extension checklist:
NewTU
changed to AzureRef
; I am aware that the data structure is inefficient due to repeated information, will probably split in two later but want everything in one table for now]nationaldata-v2
when finishednationaldata-v2
when finishedcountydata-v2
when finishedTo clarify, replacing QUANT is somewhere on my backlog, but not happening anytime soon.
As new input data becomes ready, it'd be great to get it in Azure (under a new v2 directory), so we can start pointing the Rust code at real files somewhere
Ok, can you add the one currently in use to nationaldata-v2? There are two of them in the other folder and I can't quite remember the history of the one tagged "_spc"
Things're coming together! Aside from the countydata files from SPENSER, we also need
OSM is downloaded inside the model according to the links from the OSM
field of the lookup. The new look-up includes Wales and Scotland, but there are no finer sub-region available for those.
Not sure who did those file, I can extract them from the lookup if you want.
Ah you're right, the OSM part is already sorted -- the Geofabrik URL is in lookUp-GB.csv
.
Generating the config files from the lookup isn't hard. I guess we just need to decide what we provide at https://alan-turing-institute.github.io/uatk-spc/outputs.html -- England counties and some special areas so far.
1-to-1 with Azure countydata-v2
(under AzureRef
inside the lookup), which are:
For special areas, I would add Wales and Scotland to the existing ones.
Status of missing LADs for England
Known issues (no fix):
SPENSER currently failing:
Will be added next week:
I ran all 390 combos of years/regions. 293 success, 97 failures. The failures are mostly Scotland and config/special
areas; the cause is missing data mentioned above
Status of missing LADs for England
Known issues (no fix):
SPENSER currently failing:
Is it bcs inconsistencies with the area code?
I'm not sure. I recognise some of the codes from the list of LADs that were scrapped between 2011 and 2020, but there were other reasons why SPENSER was failings for some other LADs. REG will investigate!
The following are now included following fixes to our SPENSER pipeline (further detail spc-hpc-pipeline #31) and SPC (#58):
The updates can be included in a v2.1 release - I'll open a PR for this to capture the changes.
Tbc.
A sample LAD of the new raw population file, a field dictionary and some comments will be added soon™.