Generating replenishing populations beyond the length of US data

At present, replenishing populations are generated from datafiles generated from Understanding Society (US) data. This means that the current limit to the Microsimulation is 2019 (wave 11 of US). We need to think of a way to generate replenishing populations way into the future (as long as that is the plan for this project).

Error when running beyond 2019:

2022-03-17 10:16:50
In year:  2020
alive 58900
dead 14819
2022-03-17 10:16:52.035 | DEBUG    | vivarium.framework.engine:step:169 - 2020-09-30 06:00:00
Traceback (most recent call last):
  File "scripts/run_in_console.py", line 12, in <module>
    simulation = run_pipeline(configuration_file, input_data_dir, persistent_data_dir, output_dir)
  File "/home/luke/Documents/MINOS/Minos/scripts/run.py", line 97, in run_pipeline
    simulation = RunPipeline(config, start_population_size)
  File "/home/luke/Documents/MINOS/Minos/minos/VphSpenserPipeline/RunPipeline.py", line 106, in RunPipeline
    simulation.run_for(duration=pd.Timedelta(days=365.25))
  File "/home/luke/anaconda3/envs/Minos/lib/python3.8/site-packages/vivarium/interface/interactive.py", line 105, in run_for
    return self.run_until(self._clock.time + duration, with_logging=with_logging)
  File "/home/luke/anaconda3/envs/Minos/lib/python3.8/site-packages/vivarium/interface/interactive.py", line 129, in run_until
    self.take_steps(number_of_steps=iterations, with_logging=with_logging)
  File "/home/luke/anaconda3/envs/Minos/lib/python3.8/site-packages/vivarium/interface/interactive.py", line 158, in take_steps
    self.step(step_size)
  File "/home/luke/anaconda3/envs/Minos/lib/python3.8/site-packages/vivarium/interface/interactive.py", line 66, in step
    super().step()
  File "/home/luke/anaconda3/envs/Minos/lib/python3.8/site-packages/vivarium/framework/engine.py", line 172, in step
    self.time_step_emitters[event](self._population.get_population(True).index)
  File "/home/luke/anaconda3/envs/Minos/lib/python3.8/site-packages/vivarium/framework/lifecycle.py", line 380, in _wrapped
    return method.__func__(*args, **kwargs)
  File "/home/luke/anaconda3/envs/Minos/lib/python3.8/site-packages/vivarium/framework/event.py", line 122, in emit
    listener(e)
  File "/home/luke/Documents/MINOS/Minos/minos/modules/replenishment.py", line 169, in on_time_step
    new_wave = pd.read_csv(f"data/corrected_US/{self.current_year}_US_cohort.csv")
  File "/home/luke/anaconda3/envs/Minos/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/home/luke/anaconda3/envs/Minos/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/luke/anaconda3/envs/Minos/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 575, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/luke/anaconda3/envs/Minos/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 933, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/home/luke/anaconda3/envs/Minos/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1217, in _make_engine
    self.handles = get_handle(  # type: ignore[call-overload]
  File "/home/luke/anaconda3/envs/Minos/lib/python3.8/site-packages/pandas/io/common.py", line 789, in get_handle
    handle = open(
FileNotFoundError: [Errno 2] No such file or directory: 'data/corrected_US/2020_US_cohort.csv'
Makefile:44: recipe for target 'testRun' failed
make: *** [testRun] Error 1

One way of generating these populations is to use the same method as I use in the FEM - to reweight an initial replenishing population by demographic characteristics. We would need to attach a cross-sectional analysis weight variable from Understanding Society, which may not be trivial according to the weighting faq's.

Steps:

[x] #28
[x] Add functionality to replenishment.py to generate re-weighted populations into the future

it may just be simpler to stop adding cohorts at 2020. at least for now.

Cross-sectional analysis weight has now been added (weight) - see #28 for more info on that.

We now need to modify the replenishment module to be able to re-weight and add new populations into the future based on key statistics. For a first attempt, we'll re-weight by age, sex, ethnicity using data from the principal population projections from the ONS.

[x] Get PPP data
[x] Replace current replenishment behaviour **
[x] Add functionality to adjust weight variable based on counts by age,sex

** Current replenishment behaviour is to add people from the current wave of Understanding Society to the model. This is fine whilst we are nowcasting (simulating from 2009-2019) but doesn't work into the future. I think we need to adjust how replenishment worked, to instead add a group of 16 year olds (youngest age) into the model at each wave instead of a group of all ages. We can then re-weight the 16 year old cohort by the key statistics to get a representative population into the future. I think this would mean that we have to start with a larger group, i.e. everyone present half way through the survey (wave 5/6).

Current plan is to change the replenishment functionality to add in 16 year olds that are aged out of that bracket at each wave. This adds the complication that the highest level of education will change for these groups probably from age 16-30. This means we will need an education module to change the highest level of education qualification attained for people within these groups.

One idea to handle this is to run a linear model based on a number of factors to predict the highest level of educational attainment that a simulant will obtain, and we can do this prediction when they are added to the model. I.e. at age 16 make a prediction of highest educational attainment, and make the change later on when they reach the required age. This would probably mean creating a temporary variable for the highest attainment, and changing the actual highest level of education at a specified time. Two alternatives are to just make the change immediately, or to only run the module at a single age (say 25) for everyone and changing the education level once, but both of these ideas have downsides.

Replenishing populations are now generated during the data generation pipeline from 2019-2070, and the analysis weights are adjusted based on counts by age, sex, and year (only age 16 are included in this pop). The starting population have also been reweighted, as the adjusted weights for the replenishing population were different from original, so I wanted to keep the starting and replenishing population weights in a similar range.

Plan for handling education:

[ ] Estimate model for predicting highest level of education qualification for all 16 year olds in the replenishing pop.
[ ] Estimate model for predicting education for people aged 16-25 in stock population
[ ] Do this estimation as part of the data generation pipeline
[ ] Each wave in the education module, check for people at specific ages to update their current educ into the predicted highest level. i.e. A-level (3) at 18, 1st degree (6) at 21 etc.

First iteration of this is going to change education at the same ages, so the education trajectory for people with the same highest level will be exactly the same. This is not fully representative of real life, but is a reasonable decision to make for a first attempt. It will mean that full time students will go through all the stages of education from 16-30 (up to PhD) and will join the workforce with their highest level of qualification so the labour state will be able to use that information. It does mean we miss some students who work part time jobs alongside studying, but that can be picked up in a later iteration.

Leeds-MRG / Minos

Generating replenishing populations beyond the length of US data #27