Add more years to fertility from NewEthPop

paddy-r commented 1 year ago

Issues to discuss/clarify with Rob and Luke during development, to be converted to jobs if agreed:

(1) Clarify difference b/t key_columns and parameter_columns in add_new_birth_cohorts.setup -> DONE (see Rob's comment below) (2) Which is current, FertilityAgeSpecificRates or nkidsFertilityAgeSpecificRates? (Presumably the latter.) -> DONE (the latter) (3) Currently data_generation.convert_rate_data, which generates rate table file, not called anywhere. How about calling during installation/setup to ensure regional output files present? Or during fertility module initialisation? -> DONE as moved to job below (4) ~~Should mean be weighted (by population? NewEthPop data exist) in collapse_LAD_to_region (also, collapse_location)?~~ -> DONE, as converted to job below (5) Are LA and region definitions current? If not, can use code from Inclusive Economy to generate lookups -> DONE as moved to new issue (#219), could be useful but not a priority for now as currently aggregated into region anyway (6) Why LAs used in BaseHandler.compute_migration_rates (presumably for migration modules?) but regions used in fertility module? -> DONE, as answered by Rob below (7) Think about how to generalise output/logging functionality in RunPipeline (and already a comment about it there) as very useful for me during fertility module development (cf. #167) -> already partly addressed in job below -> create new issue if good idea to add more detailed functionality -> marking as DONE as vague and not priority; also at least partly covered by job below (8) ~~How/where to generate/view specific variables during simulation, in the first instance fertility rate and year for which data is sought and year for which data are available?~~ -> marking as DONE as (1) vague, (2) will become clearer over time and (3) at least partly covered by jobs below (9) How to visualise effects on SF-12? Will only be tiny numerical differences for now (as only changing range of NewEthPop data used here), but would be good to understand how to do it for later in fertility development process. E.g. need new make target somewhere (outcomes/Makefile) -> marking as DONE because vague and covered by jobs below (99) Once everything here done, discuss duplicating functionality to mortality module, as very similar (e.g. rate table generation, as format of NewEthPop fertility and mortality input data is almost identical) -> DONE as podded off into another issue (#213)

Rough to-do list:

[x] Generate all-years (i.e. all in NewEthPop, as currently only done for 2011) fertility file in data_generation.convert_rate_data, as similar functionality already there
[x] FutureWarning in data_generation.convert_rate_data; also podded off into #212 as called elsewhere (i.e. outside fertility module) as well
[x] Ensure path to all-years fertility input file passed to FertilityRateTable.__init__ when called from add_new_birth_cohorts -> changed file to that containing all years, format is identical except year column added; only grabbing 2011-2012 though
[x] Ensure correct year range used in FertilityRateTable._build, currently hard-coded to pass year_start = 2011, year_end = 2012 to transform_rate_table
[x] Add get_nearest_year functionality if nothing present, but where? In utils? Purpose is to get nearest year of data for a particular simulation year, in case that particular year isn't present in rate table. Put to new issue later if useful for other modules/general use
[x] Re. two points above, add functionality to select greatest number of possible years specified in config from available intermediate cache; NewEthPop available for 2011-2061, but what if simulation is for 2009-2013 => would want [2012, 2013]; simplest way is to get nearest years in cached file from years specified in config file
[x] Ensure data_generation.convert_rate_data called during fertility/mortality module initialisation if necessary (i.e. if cached file not present); also added to #213
[x] Calculated weighted mean (rather than unweighted, as currently) in collapse_LAD_to_region (also, collapse_location), e.g. with NewEthPop population data -> marking as done here as moved to issue #218
[x] Verify that asfr ultimately has all years of NewEthPop fertility data in memory -> just checked with print statement, not necessary to do anything more than that; also done for mortality, see #213
[x] In add_new_birth_cohorts.nkidsFertilityAgeSpecificRates.setup add year to requires_columns (argument toregister_rate_producer), but see (1) above -> marking as done as not necessary for year but is for parity? Added to #167 for now
[x] Also add year to view_columns (argument to get_view)? Also see (1) above -> don't need to add year, but do need to add parity? -> marking as done as not necessary for year but is for parity? Added to #167 for now
[x] Verify/understand how variables are generated and passed through entire pipeline, i.e. from config file, through Vivarium, to output, for my understanding, e.g. REGION.name and ETH.GROUP in BaseHandler; exact process not clear ATM, so add detail/new issue later; cf. (7) above -> marking as done as grouped into #220
[x] Add year ~~(and parity, which should just a dummy for now; to be addressed in another issue, cf. #167)~~ to N-nested for loops in BaseHandler to account for all-year rate table
[x] Consider automating the creating of N-nested for loops, e.g. combinatorially via itertools.combinations (v. easy) -> moving to new issue (#217) but probably not necessary and not a priority
[x] Add try-except block to BaseHandler.cache as rate_table_path not defined by default -> marking as done as moved to #221
[x] In RunPipeline, move components map(s) outside of method/class in case useful elsewhere
[x] Create functionality ~~(priority_sort)~~ to sort components by priority in RunPipeline, re. Rob's warning in config files -> validate_and_sort_components, called from ``RunPipeline```
[x] As above, but revised following discussion of point (10)
[x] Add some logging/print statements at bottom of RunPipeline.RunPipeline for nkidsFertilityAgeSpecificRates and fertility by year (partly addresses one point in #167) -> marking as done as grouped into #220
[x] How to visualise effect on SF-12? -> marking as done as grouped into #220
[x] Create new fertility config file(s) for testing -> currently fertility_default.yaml
[x] ~~Configure config file to take either (a) new, compiled all-years fertility file, or (b) folder (of NewEthPop fertility data) rather than single file (depending on how development goes/discussion)~~ -> marking as done as duplicate of another job above, and specifying folder rather than single file is unnecessary
[x] Remove current text on order of components in config file once priority_sort functionality done, and add some alternative notes there -> only in fertility_default.yaml for now
[x] Create new fertility baseline target in scripts/Makefile, for comparison with old fertility baseline -> currently fertility_testing in scripts/Makefile

RobertClay commented 1 year ago

will come up with some answers for thursday.

paddy-r commented 1 year ago

will come up with some answers for thursday.

Thanks, I'll try and get a load of the jobs done in the meantime.

RobertClay commented 1 year ago

(1) Clarify difference b/t key_columns and parameter_columns in add_new_birth_cohorts.setup

Another very undocumented part of vivarium.. Its an interpolated lookup table. Make sure you understand lookup tables and linear interpolation before you read this. key_columns are the look up variables. E.g. for key_columns = [region, sex, ethnicity] it will find the rows in the lookup table with those values like [East Midlands, F, BAN]. There can be more than one row here.

parameter_columns = [age, time] is more complicated and uses linear interpolated lookup (order 0 I think?). For an observation in the population you can have continuous age and year timestamp e.g. [age, year] = [51.1245, 2012.12412]. The problem is how to estimate fertility rate given we have discrete values in the lookup table at age 51/52 and years 2012/2013. In the lookup table age_specific_fertility_rate we provide vivarium 4 columns age_start, age_end, year_start, year_end. Specifying parameter_columns age and time tells vivarium that observations on these values will be continuous data and which columns to use for start and end points of linear interpolation. This is probably better demonstrated with a diagram. Happy to discuss more.

(2) Which is current, FertilityAgeSpecificRates or nkidsFertilityAgeSpecificRates? (Presumably the latter.)

The latter.

(3) Currently data_generation.convert_rate_table, which generates rate table file, not called anywhere. How about calling during installation/setup to ensure regional output files present? Or during fertility module initialisation?

It should be called somewhere yes. Are you sure its not in the fertility pre_setup function .set_rate_table()? I believe they're cached as they can be quite expensive to generate particularly if youre adding more data in.

(4) Should mean be weighted (by population? NewEthPop data exist) in collapse_LAD_to_region (also, collapse_location)?

Not sure. I did this very roughly and not sure if there are suitable weights available. One to discuss on video I think.

(5) Are LA and region definitions current? If not, can use code from Inclusive Economy to generate lookups

I believe they're 2019? I had to manually adjust some areas (northamptonshire/gloucestershire?) that changed their boundaries recently. Your IE code will be better.

(6) Why LAs used in BaseHandler.compute_migration_rates (presumably for migration modules?) but regions used in fertility module?

We don't use migration in MINOS. Its from the old model Daedalus that does use LA level data. I'd say ignore it for now but Nik would probably love you if you did some maintainence on daedalus too.

(7) Think about how to generalise output/logging functionality in RunPipeline (and already a comment about it there) as very useful for me during fertility module development (cf. https://github.com/Leeds-MRG/Minos/issues/167) -> already partly addressed in job below -> create new issue if good idea to add more detailed functionality

Lukes done a lot of logging. Id suggest talking to him but python logging module is usually pretty clear and easy to add to. The more the merrier.

(8) How/where to generate/view specific variables during simulation, in the first instance fertility rate and year for which data is sought and year for which data are available?

Pycharm debug flags may be useful here? Or some kind of verbose mode.

(9) How to visualise effects on SF-12? Will only be tiny numerical differences for now (as only changing range of NewEthPop data used here), but would be good to understand how to do it for later in fertility development process. E.g. need new make target somewhere (outcomes/Makefile)

Are the current lineplots we have sufficient? This is a larger problem we're having at the moment for how to visualise the csv outputs. Discuss.

(99) Once everything here done, discuss duplicating functionality to mortality module, as very similar (e.g. rate table generation, as format of NewEthPop fertility and mortality input data is almost identical)

100% do this next. They're very similar with slight differences (e.g. men can die but not give birth).

RobertClay commented 1 year ago

Interpolated lookup diagram.

paddy-r commented 1 year ago

Another question, very trivial...

(10) Which have higher priority, interventions or mortality/fertility modules, based on text in default.yaml and RunPipeline?

paddy-r commented 1 year ago

Another question, very trivial...

(10) Which have higher priority, interventions or mortality/fertility modules, based on text in default.yaml and RunPipeline?

From discussion, 20/04/23, priority is:

Replenishment
Fertility, then mortality
Intervention (if present)
All pathways
SF-12

Added to list of jobs.

paddy-r commented 1 year ago

Closed with #259.

Leeds-MRG / Minos

Add more years to fertility from NewEthPop #211