Update initial condition specifications and use of geodata file

Currently both geodata and initial conditions define the population and mobility. We should reduce this confusion/redundancy.

I don't understand what this means. The initial conditions file does not define the population and mobility. However, in the initial conditions file, if you specify the initial size of each compartment in a population, it should add up to the total population size specified in geodata. I THINK right now an error will be thrown if it doesn't. ALSO, there is an option in the code that you don't have to give the ICs for each compartment, but can give it for some and then write "rest" for another compartment, and it will use the total population size in geodata, minus the other compartments specified already in IC, to populate the ICs for this "rest" compartment. BUT, I believe this "rest" option is not currently working for some reason - I tried it and it failed, despite being in code.

Related: Perhaps consider renaming "geodata" to something more general like "popstruct"

feodata and initial conditions are doing the same things. Ideally subpopulations should be defined by a single file (like the location.csv in the hubverse notation).

The population will be allowed to vary, for e.g birth and death processes. Currently geodata's population is used as node denominator but when initial_condition population is not equal, it has some issue. These are solved currently by failing when the total don't match. However, it has hard to get matching totals with floating point errors.

Discussion on issue #91

Now the issue of 82 is that: geodata specifies to gempyor:

the list of subpops and their name (this is fine)
the population
- this is ambiguous. Where does this population go if initial_conditions are not set ?? By default it's the first meta_compartment (here S_unvaccSuscH1_H1N1_age0to4_14to15) which is foot-shooting behavior. The goal is to force the user to use initial conditions.
- But if we force the user to set initial_conditions, what happens of geodata's population ? Well for now there is a check the the initial_conditions' population and geodata's population match, however:
  - This is hard to do (software) so there is often a difference greater than the threshold (1 person) which causes the model to fail.
  - In response to its failing, the user (as I did here) just adds the flag ignore_population_check: True
  - which in turn, if the difference is large (and it's easy to miss a large difference) then we get our bug this run: the denominator of the homogenous mixing (from geodata) is different from the actual population being mixed (from initial condition)
So the goal is to force initial_conditions and to get population from these. Moreover, with birth-death processes that might be added, the population is probably not the concept we want to keep fixed per subpop.

I thought a lot about this last few days and agree with the issues with geodata file vs initial_conditions, and have detailed a proposal for dealing with it all:

get rid of config: setup_subpop option and geodata type file
require theinitial_conditions section of the config. By definition, any dynamical system must have initial conditions so this makes sense
Allow the following initial condition methods (I just made up names for now, can be changed, but would love something clearer than what we have now)
- method:FromFileInput
  - This should be the default and simplest option for users making simple config files
  - initial_conditions:initial_conditions_file is a csv or parquet file with the columns subpop, mc_name, amount, where mc_name is something like S_child_unvaxxed. amount is the # of individuals in the compartment at the model start time. The user does not need to give initial conditions for every compartment; those not listed will be assumed to be zero (see other config options below), and the total initial population size for each compartment will be taken as the sum of the specified initial conditions for all compartments in that subpopulation
  - Other config options:
    - allow_missing_compartments: TRUE or FALSE. Default toFALSE, will throw error if initial conditions are not specified for every single compartment in each subpopulation. If TRUE, will assume missing compartments have zero initial condition.
    - REMOVE the former "allow_missing_nodes" option, as now if we don't list an initial_condition for a node the population size will remain zero forever
    - REMOVE the former "proportional" option, as we now need initial conditions file to have information on the total population size
  - when model runs, creates init file as output directory type, thatinitfile has initial conditions for ALL compartments in the same format as model_output, one file for each slot+iteration
- method: FromFileOutput
  - This option is for when users want to use the output of a previous simulation for the initial conditions of the current simulation
  - initial_conditions:initial_conditions_file is a csv or parquet file with the columns something like mc_value_type, mc_strata1, mc_strata2, .....,mc_strataN, mc_name, subpop_name_1, .... subpop_name_n, date. All compartments for all subpopulations must be listed. The total initial population size for each compartment will be taken as the sum of the specified initial conditions for all compartments in that subpopulation at the time of the simulation start
  - No other config options
- method:FromFileInputProportional
  - This method is similar to FromFileInput except that the user can give the initial conditions as fractions of the total population size, and must specify a separate file with the initial total population sizes for each subpopulation -initial_conditions: subpop_file is a csv or parquet file with the columns subpop, population , where "population" is the # of individuals, just like the existing geodata file.
  - initial_conditions:initial_conditions_file is a csv or parquet file with the columns subpop, mc_name, proportion, where mc_name is something like S_child_unvaxxed. "proportion" is a fraction for % of population initially in that compartment, or, the term "rest", which means the fraction not specified will be allocated to this compartment. The user does not need to give initial conditions for every compartment; those not listed will be assumed to be zero (see other config options below). Each subpop must have at least one entry, and the sum of those entries must be less than 1. If "proportion" is "rest" for the only compartment specified, the entire population size is assigned there -method: FromFolderInput
  - This method is identical to FromFileInput, except that now the user specifies a directory instead of a file. Inside that directory there should be list of files, and their names should be numbers that will correspond to the independent simulation numbers (slots for inference or just simulations for non-inference runs). (Note: before we had to specify the entire output file name with runID etc but I don't see why this is needed) -method:FromFolderOutput
  - This method is identical to FromFileOutput, except that now the user specifies a directory instead of a file. Inside that directory there should be list of files, and their names should be numbers that will correspond to the independent simulation numbers (slots for inference or just simulations for non-inference runs). (Note: before we had to specify the entire output file name with runID etc but I don't see why this is needed)
For doing inference on initial conditions
- I am not 100% sure how this was being handled before, so maybe there's any easier way of thinking about this that's already implemented. This is all I could come up with
- if the "perturbation" option exists in the initial_conditions section, the inference can be performed on initial conditions. Each initial conditions file, regardless of type, should now have a perturbcolumn with values 0 or 1 if that initial condition should be perturbed.
- add extra config option constrain_subpop_total which is TRUE or FALSE (default FALSE) and describes whether the perturbations must preserve the total subpopulation size or not. If constrain_subpop_total : FALSEthen each initial condition is perturbed independently and the total subpopulation size may change between runs.
- If constrain_subpop_total: TRUE things are more complicated because it is not clear the best way to simultaneously vary one or more initial conditions while making sure the total initial population size of that subpopulation doesn't change. I think the only way to do this is to take all compartments that have "perturb" by them and draw a new initial condition from a Dirichlet/Logit Normal distribution that allows for the sum of the values to be the same, with a specified mean and variance-ish-value for each:
  - https://stats.stackexchange.com/questions/220543/generate-a-random-set-of-numbers-with-fixed-sum-and-desired-means-and-variances
  - https://en.wikipedia.org/wiki/Logit-normal_distribution
- If the method is FromFileInputProportional and if there is only one initial condition in a subpopulation being perturbed, then the initial condition of the compartment with amount "rest" will also be perturbed to keep sum constant (regardless of the value in the perturb column for this entry. Might want to throw a warning if value is 0 here. ).
- We need to decide if perturbations will only consider integar values of the "amount" value when perturbing it (but allow any real value between 0 and 1 for the FromFileInputProportional Method)
Related changes
- The model should keep track of total population size over time to use as the denominator in force-of-infection rates. Total population size could potentially also be recorded in the SEIR file.
- These changes will help make it more logical to adapt gempyor to be able to have 0th order input rates - like births - that increase population size, and 1st order output rates that have no destination, like deaths - both of which change total population size over time

Thanks @alsnhll, that's some really great and consistent choices. Some comments

Non proportional methods

get rid of config: setup_subpop option and geodata type file

I think we'd still need a list of supop names somewhere: it's important that in case of the FromFolder method, the plotting script and checks would make that important. The subpop setup also contains mobility.

require theinitial_conditions section of the config. By definition, any dynamical system must have initial conditions so this makes sense

Agree.

Allow the following initial condition methods (I just made up names for now, can be changed, but would love something clearer than what we have now)

I agree, our names are bad and need changes. I could not find any better names than what you are proposing, but I'm not a fan of FromFileOutput.

method:FromFileInput

This should be the default and simplest option for users making simple config files

initial_conditions:initial_conditions_file is a csv or parquet file with the columns subpop, mc_name, amount, where mc_name is something like S_child_unvaxxed. amount is the # of individuals in the compartment at the model start time. The user does not need to give initial conditions for every compartment; those not listed will be assumed to be zero (see other config options below), and the total initial population size for each compartment will be taken as the sum of the specified initial conditions for all compartments in that subpopulation

Yes. So that's very close to what is implemented. Let's just keep in mind that while gempyor allows mc_name, I prefer user (i.e the documentation) to instead submit columns mc_vaccinatation_stage, mc_infection_stage as these are clearer from the config (mc_name is a unique compartment name created by gempyor -- which may change, and is ordered in the same order as the config, but still).

* Other config options:

  * `allow_missing_compartments`: `TRUE` or `FALSE`. Default to` FALSE`, will throw error if initial conditions are not specified for every single compartment in each subpopulation. If `TRUE`, will assume missing compartments have zero initial condition.
  * REMOVE the former "`allow_missing_nodes`" option, as now if we don't list an initial_condition for a node the population size will remain zero forever

Agree, it's a good change, but allow_missing_subpops is good because it raises an error which is sometime convenient ? I don't know

  * REMOVE the former "`proportional`" option, as we now need initial conditions file to have information on the total population size

agree

* when model runs, creates init file as output directory type, that`init`file has initial conditions for ALL compartments in the same format as `model_output`, one file for each slot+iteration

Soo... I think it's useful to not have that everytime. I was thinking about an alternative config option to "broadcast" seeding/initial_conditions to file system structure. if activated, it would move the seeding/ic_file to the model_output/seed or init so e.g inference or other scripts can be run on these (using 1 starting value). But by default de-activated (instead of what you are proposing) because with that config:

initial_conditions:
  method: FromFileInput
  initial_conditions_file: data/my_ic.csv

if you run gempyor two times with the same run_id, then the second time it takes from the filesystem even if you had modified your data/my_ic.csv. I feel like this is confusing behaviour and better as an opt-in (moreover, less files to upload/read write is good).

method: FromFileOutput

This option is for when users want to use the output of a previous simulation for the initial conditions of the current simulation

initial_conditions:initial_conditions_file is a csv or parquet file with the columns something like mc_value_type, mc_strata1, mc_strata2, .....,mc_strataN, mc_name, subpop_name_1, .... subpop_name_n, date. All compartments for all subpopulations must be listed. The total initial population size for each compartment will be taken as the sum of the specified initial conditions for all compartments in that subpopulation at the time of the simulation start

No other config options

agree

Proportional methods

method:FromFileInputProportional

This method is similar to FromFileInput except that the user can give the initial conditions as fractions of the total population size, and must specify a separate file with the initial total population sizes for each subpopulation -initial_conditions: subpop_file is a csv or parquet file with the columns subpop, population , where "population" is the # of individuals, just like the existing geodata file.

subpop_population_file perhaps ?

* `initial_conditions:initial_conditions_file` is a csv or parquet file with the columns `subpop, mc_name, proportion`, where `mc_name` is something like `S_child_unvaxxed`. "`proportion`" is a fraction for % of population initially in that compartment, or, the term "`rest`", which means the fraction not specified will be allocated to this compartment. The user does not need to give initial conditions for every compartment; those not listed will be assumed to be zero (see other config options below). Each subpop must have at least one entry, and the sum of those entries must be less than 1.  If "`proportion`" is "`rest`" for the only compartment specified, the entire population size is assigned there

True, this is very close to what is currently implemented (save for the column name).

  -`method: FromFolderInput`
* This method is identical to `FromFileInput`, except that now the user specifies a directory instead of a file. Inside that directory there should be list of files, and their names should be numbers that will correspond to the independent simulation numbers (slots for inference or just simulations for non-inference runs). (Note: before we had to specify the entire output file name with runID etc but I don't see why this is needed)

I think we currently impose that this is factored in flepimop filesystem, which is good and in line with current effort to make the config more rigid.

  -` method:FromFolderOutput`
* This method is identical to `FromFileOutput`, except that now the user specifies a directory instead of a file. Inside that directory there should be list of files, and their names should be numbers that will correspond to the independent simulation numbers (slots for inference or just simulations for non-inference runs). (Note: before we had to specify the entire output file name with runID etc but I don't see why this is needed)

About inference

For doing inference on initial conditions

I am not 100% sure how this was being handled before, so maybe there's any easier way of thinking about this that's already implemented. This is all I could come up with

if the "perturbation" option exists in the initial_conditions section, the inference can be performed on initial conditions. Each initial conditions file, regardless of type, should now have a perturbcolumn with values 0 or 1 if that initial condition should be perturbed.

yes, I believe that is what is done at the moment but I'll check. However right now we can only fit proportional initial conditions files (to keep the population total while being simpler)

add extra config option constrain_subpop_total which is TRUE or FALSE (default FALSE) and describes whether the perturbations must preserve the total subpopulation size or not. If constrain_subpop_total : FALSEthen each initial condition is perturbed independently and the total subpopulation size may change between runs.

that's a good idea

If constrain_subpop_total: TRUE things are more complicated because it is not clear the best way to simultaneously vary one or more initial conditions while making sure the total initial population size of that subpopulation doesn't change. I think the only way to do this is to take all compartments that have "perturb" by them and draw a new initial condition from a Dirichlet/Logit Normal distribution that allows for the sum of the values to be the same, with a specified mean and variance-ish-value for each:

https://stats.stackexchange.com/questions/220543/generate-a-random-set-of-numbers-with-fixed-sum-and-desired-means-and-variances

https://en.wikipedia.org/wiki/Logit-normal_distribution

that's a very cool idea. Note that there are functions to sample from these (https://www.pymc.io/projects/docs/en/stable/api/distributions/generated/pymc.StickBreakingWeights.html often called stick breaking weights: preserve sum). I like that.

If the method is FromFileInputProportional and if there is only one initial condition in a subpopulation being perturbed, then the initial condition of the compartment with amount "rest" will also be perturbed to keep sum constant (regardless of the value in the perturb column for this entry. Might want to throw a warning if value is 0 here. ).

We need to decide if perturbations will only consider integar values of the "amount" value when perturbing it (but allow any real value between 0 and 1 for the FromFileInputProportional Method)

yes, I think that should be an option. In fact, it should be specifed to inference only.

Related changes

The model should keep track of total population size over time to use as the denominator in force-of-infection rates. Total population size could potentially also be recorded in the SEIR file.

agree, and I think we should record it (so it can also be an outcome for postprocessing scripts.

These changes will help make it more logical to adapt gempyor to be able to have 0th order input rates - like births - that increase population size, and 1st order output rates that have no destination, like deaths - both of which change total population size over time

🥳

These methods should also be though so that the interface to seeding is consistent (which I think is the case here)

init file only broadcast and save to init_files if it's being perturbed. If it is a resume run, the initial_conditions_file must be removed from config and method = "FromFolderInput"

Will convene meeting to do this as a group.

Just as a summary of the current initial conditions methods (following from some slack back and forth)

Methods:

SetInitialConditions
- default, simplest
- long format
- can have missing compartments
- initial_conditions:initial_conditions_file is a csv or parquet file with the columns subpop, mc_name, amount, where mc_name is something like S_child_unvaxxed. amount is the # of individuals in the compartment at the model start time. The user does not need to give initial conditions for every compartment; those not listed will be assumed to be zero (see other config options below), and the total initial population size for each compartment will be taken as the sum of the specified initial conditions for all compartments in that subpopulation
- TO DO: remove former allow_missing_nodes and proportional (if we want to remove the use of geodata, as we need this to define the population sizes)
- during a multiple-slot run, file is copied for each slot
SetInitialConditionsFolderDraw
- like SetInitialConditions but with a file per slot (pre-set by user)
FromFile
- columns are output from previous simulation all compartments/strata mc_*
- this is a single file in wide format like SEIR files
- since this is from a previous simulation, the total population size for subpops is given
InitialConditionsFolderDraw
- like FromFile but a bunch of slots - have to be in the right format, like in resumes
- How do we define what folder to look at?

I'm not a huge fan of these method names as I don't personally find them very intuitive - i.e. FolderDraw sounds to me like it's a random draw from a set of options but it's really just assigning the equivalent slot to the equivalent file in the folder. But also I can't think of anything better than FolderDraw, so probably fine as long as documented correctly and clearly. If we keep SetInitialConditions and SetInitialConditionsFolderDraw, then FromFile and InitialConditionsFolderDraw should match in naming structure (since they follow the same idea: 1 file vs a folder of files. Note: see screenshot). FromFile I find counterintuitive because technically they are all from a file... 🤔 Also SetInitialConditions not super intuitive to me because in all methods we are kind of 'setting' initial conditions...

I don't love the following suggestions, but just spit balling here...

How about some derivative of...

SetInitialConditions -> FromFile
SetInitialConditionsFolderDraw -> FromFileFolder
FromFile -> FromSimulation (i guess technically it doesn't have to be from a simulation, i.e. you could create this file yourself... but points to the file structure being the same)
InitialConditionsFolderDraw -> FromSimulationFolder

Alternatively could make the long/wide distinction obvious?

SetInitialConditions -> FileLong
SetInitialConditionsFolderDraw -> FolderLong
FromFile -> FileWide
InitialConditionsFolderDraw -> FolderWide

I prefer the first option here though.

Some random other notes relating to the previous comments on this issue:

If we want to phase out geodata: I'm not sure if we should anymore, as the runs I've been doing are using all the same initial conditions, seeding, ground truth data files (with multiple subpopulations) and using the geodata file to define just which subpops to look at in a given config. I think this is a good flow of the pipeline?

Regarding inference on initial conditions -> I'm not sure exactly what is done at the moment but we have not properly stress tested the proportional method (perturbation of initial conditions is currently broken).

I'm just adding another note about the issue of geodata file. When looking through our code for hierarchical likelihoods, I realized that the geodata file can also used to specify groups of subpopulations that should be considered to have similar parameter values. An extra column can be added, and this column will be used to calculate an additional term to the likelihood - a sort of post-hoc group-level modeling approach - that penalizes parameter proposals whenever grouped subpopulations have values that are further apart from each other (or something like that, method a bit unclear). This is a reason we need to keep the geodata file.

I am proposing a new way of specifying initial conditions and their options, getting ride of these confusing method names like SetInitialConditionsFolderDraw. It includes ideas for how perturbations (inference) and file saving work. Check out this file for the proposed config options and their meanings. Share any feedback here or in comments on the spreadsheet! https://docs.google.com/spreadsheets/d/1ITgNAFuGKRhrwX_pvLUqaWq0OWmpCPvhjjdldvnYIA0/edit?usp=sharing (you should have access to this if you have access to the flepimop google drive folder) Note that we are no longer proposing to remove the geodata file. The file still exists and is used for subpop list, but the subpop sizes in there are not necessarily used - this column is not even required - they are by default taken from initial conditions. we can make very clear warnings to user about this.

HopkinsIDD / flepiMoP