Open shauntruelove opened 11 months ago
I don't understand what this means. The initial conditions file does not define the population and mobility. However, in the initial conditions file, if you specify the initial size of each compartment in a population, it should add up to the total population size specified in geodata. I THINK right now an error will be thrown if it doesn't. ALSO, there is an option in the code that you don't have to give the ICs for each compartment, but can give it for some and then write "rest" for another compartment, and it will use the total population size in geodata, minus the other compartments specified already in IC, to populate the ICs for this "rest" compartment. BUT, I believe this "rest" option is not currently working for some reason - I tried it and it failed, despite being in code.
Related: Perhaps consider renaming "geodata" to something more general like "popstruct"
feodata and initial conditions are doing the same things. Ideally subpopulations should be defined by a single file (like the location.csv in the hubverse notation).
The population will be allowed to vary, for e.g birth and death processes. Currently geodata's population is used as node denominator but when initial_condition population is not equal, it has some issue. These are solved currently by failing when the total don't match. However, it has hard to get matching totals with floating point errors.
Discussion on issue #91
Now the issue of 82 is that: geodata specifies to gempyor:
I thought a lot about this last few days and agree with the issues with geodata
file vs initial_conditions
, and have detailed a proposal for dealing with it all:
config: setup_subpop
option and geodata
type fileinitial_conditions
section of the config. By definition, any dynamical system must have initial conditions so this makes sensemethod:FromFileInput
initial_conditions:initial_conditions_file
is a csv or parquet file with the columns subpop, mc_name, amount
, where mc_name
is something like S_child_unvaxxed
. amount
is the # of individuals in the compartment at the model start time. The user does not need to give initial conditions for every compartment; those not listed will be assumed to be zero (see other config options below), and the total initial population size for each compartment will be taken as the sum of the specified initial conditions for all compartments in that subpopulationallow_missing_compartments
: TRUE
or FALSE
. Default toFALSE
, will throw error if initial conditions are not specified for every single compartment in each subpopulation. If TRUE
, will assume missing compartments have zero initial condition. allow_missing_nodes
" option, as now if we don't list an initial_condition for a node the population size will remain zero foreverproportional
" option, as we now need initial conditions file to have information on the total population sizeinit
file has initial conditions for ALL compartments in the same format as model_output
, one file for each slot+iterationmethod: FromFileOutput
initial_conditions:initial_conditions_file
is a csv or parquet file with the columns something like mc_value_type, mc_strata1, mc_strata2, .....,mc_strataN, mc_name, subpop_name_1, .... subpop_name_n, dat
e. All compartments for all subpopulations must be listed. The total initial population size for each compartment will be taken as the sum of the specified initial conditions for all compartments in that subpopulation at the time of the simulation startmethod:FromFileInputProportional
FromFileInput
except that the user can give the initial conditions as fractions of the total population size, and must specify a separate file with the initial total population sizes for each subpopulation
-initial_conditions: subpop_file
is a csv or parquet file with the columns subpop, population
, where "population
" is the # of individuals, just like the existing geodata file. initial_conditions:initial_conditions_file
is a csv or parquet file with the columns subpop, mc_name, proportion
, where mc_name
is something like S_child_unvaxxed
. "proportion
" is a fraction for % of population initially in that compartment, or, the term "rest
", which means the fraction not specified will be allocated to this compartment. The user does not need to give initial conditions for every compartment; those not listed will be assumed to be zero (see other config options below). Each subpop must have at least one entry, and the sum of those entries must be less than 1. If "proportion
" is "rest
" for the only compartment specified, the entire population size is assigned there
-method: FromFolderInput
FromFileInput
, except that now the user specifies a directory instead of a file. Inside that directory there should be list of files, and their names should be numbers that will correspond to the independent simulation numbers (slots for inference or just simulations for non-inference runs). (Note: before we had to specify the entire output file name with runID etc but I don't see why this is needed)
-method:FromFolderOutput
FromFileOutput
, except that now the user specifies a directory instead of a file. Inside that directory there should be list of files, and their names should be numbers that will correspond to the independent simulation numbers (slots for inference or just simulations for non-inference runs). (Note: before we had to specify the entire output file name with runID etc but I don't see why this is needed)perturbation
" option exists in the initial_conditions
section, the inference can be performed on initial conditions. Each initial conditions file, regardless of type, should now have a perturb
column with values 0 or 1 if that initial condition should be perturbed. constrain_subpop_total
which is TRUE or FALSE (default FALSE) and describes whether the perturbations must preserve the total subpopulation size or not. If constrain_subpop_total
: FALSE
then each initial condition is perturbed independently and the total subpopulation size may change between runs. constrain_subpop_total
: TRUE things are more complicated because it is not clear the best way to simultaneously vary one or more initial conditions while making sure the total initial population size of that subpopulation doesn't change. I think the only way to do this is to take all compartments that have "perturb" by them and draw a new initial condition from a Dirichlet/Logit Normal distribution that allows for the sum of the values to be the same, with a specified mean and variance-ish-value for each:
FromFileInputProportional
and if there is only one initial condition in a subpopulation being perturbed, then the initial condition of the compartment with amount "rest
" will also be perturbed to keep sum constant (regardless of the value in the perturb column for this entry. Might want to throw a warning if value is 0 here. ). amount
" value when perturbing it (but allow any real value between 0 and 1 for the FromFileInputProportional Method
)Thanks @alsnhll, that's some really great and consistent choices. Some comments
- get rid of
config: setup_subpop
option andgeodata
type file
I think we'd still need a list of supop names somewhere: it's important that in case of the FromFolder method, the plotting script and checks would make that important. The subpop setup also contains mobility.
- require the
initial_conditions
section of the config. By definition, any dynamical system must have initial conditions so this makes sense
Agree.
- Allow the following initial condition methods (I just made up names for now, can be changed, but would love something clearer than what we have now)
I agree, our names are bad and need changes. I could not find any better names than what you are proposing, but I'm not a fan of FromFileOutput.
method:FromFileInput
- This should be the default and simplest option for users making simple config files
initial_conditions:initial_conditions_file
is a csv or parquet file with the columnssubpop, mc_name, amount
, wheremc_name
is something likeS_child_unvaxxed
.amount
is the # of individuals in the compartment at the model start time. The user does not need to give initial conditions for every compartment; those not listed will be assumed to be zero (see other config options below), and the total initial population size for each compartment will be taken as the sum of the specified initial conditions for all compartments in that subpopulation
Yes. So that's very close to what is implemented. Let's just keep in mind that while gempyor allows mc_name, I prefer user (i.e the documentation) to instead submit columns mc_vaccinatation_stage
, mc_infection_stage
as these are clearer from the config (mc_name is a unique compartment name created by gempyor -- which may change, and is ordered in the same order as the config, but still).
* Other config options: * `allow_missing_compartments`: `TRUE` or `FALSE`. Default to` FALSE`, will throw error if initial conditions are not specified for every single compartment in each subpopulation. If `TRUE`, will assume missing compartments have zero initial condition. * REMOVE the former "`allow_missing_nodes`" option, as now if we don't list an initial_condition for a node the population size will remain zero forever
Agree, it's a good change, but allow_missing_subpops is good because it raises an error which is sometime convenient ? I don't know
* REMOVE the former "`proportional`" option, as we now need initial conditions file to have information on the total population size
agree
* when model runs, creates init file as output directory type, that`init`file has initial conditions for ALL compartments in the same format as `model_output`, one file for each slot+iteration
Soo... I think it's useful to not have that everytime. I was thinking about an alternative config option to "broadcast" seeding/initial_conditions to file system structure. if activated, it would move the seeding/ic_file to the model_output/seed or init so e.g inference or other scripts can be run on these (using 1 starting value). But by default de-activated (instead of what you are proposing) because with that config:
initial_conditions:
method: FromFileInput
initial_conditions_file: data/my_ic.csv
if you run gempyor two times with the same run_id, then the second time it takes from the filesystem even if you had modified your data/my_ic.csv. I feel like this is confusing behaviour and better as an opt-in (moreover, less files to upload/read write is good).
method: FromFileOutput
- This option is for when users want to use the output of a previous simulation for the initial conditions of the current simulation
initial_conditions:initial_conditions_file
is a csv or parquet file with the columns something likemc_value_type, mc_strata1, mc_strata2, .....,mc_strataN, mc_name, subpop_name_1, .... subpop_name_n, dat
e. All compartments for all subpopulations must be listed. The total initial population size for each compartment will be taken as the sum of the specified initial conditions for all compartments in that subpopulation at the time of the simulation start- No other config options
agree
method:FromFileInputProportional
- This method is similar to
FromFileInput
except that the user can give the initial conditions as fractions of the total population size, and must specify a separate file with the initial total population sizes for each subpopulation -initial_conditions: subpop_file
is a csv or parquet file with the columnssubpop, population
, where "population
" is the # of individuals, just like the existing geodata file.
subpop_population_file perhaps ?
* `initial_conditions:initial_conditions_file` is a csv or parquet file with the columns `subpop, mc_name, proportion`, where `mc_name` is something like `S_child_unvaxxed`. "`proportion`" is a fraction for % of population initially in that compartment, or, the term "`rest`", which means the fraction not specified will be allocated to this compartment. The user does not need to give initial conditions for every compartment; those not listed will be assumed to be zero (see other config options below). Each subpop must have at least one entry, and the sum of those entries must be less than 1. If "`proportion`" is "`rest`" for the only compartment specified, the entire population size is assigned there
True, this is very close to what is currently implemented (save for the column name).
-`method: FromFolderInput` * This method is identical to `FromFileInput`, except that now the user specifies a directory instead of a file. Inside that directory there should be list of files, and their names should be numbers that will correspond to the independent simulation numbers (slots for inference or just simulations for non-inference runs). (Note: before we had to specify the entire output file name with runID etc but I don't see why this is needed)
I think we currently impose that this is factored in flepimop filesystem, which is good and in line with current effort to make the config more rigid.
-` method:FromFolderOutput` * This method is identical to `FromFileOutput`, except that now the user specifies a directory instead of a file. Inside that directory there should be list of files, and their names should be numbers that will correspond to the independent simulation numbers (slots for inference or just simulations for non-inference runs). (Note: before we had to specify the entire output file name with runID etc but I don't see why this is needed)
For doing inference on initial conditions
- I am not 100% sure how this was being handled before, so maybe there's any easier way of thinking about this that's already implemented. This is all I could come up with
- if the "
perturbation
" option exists in theinitial_conditions
section, the inference can be performed on initial conditions. Each initial conditions file, regardless of type, should now have aperturb
column with values 0 or 1 if that initial condition should be perturbed.
yes, I believe that is what is done at the moment but I'll check. However right now we can only fit proportional initial conditions files (to keep the population total while being simpler)
- add extra config option
constrain_subpop_total
which is TRUE or FALSE (default FALSE) and describes whether the perturbations must preserve the total subpopulation size or not. Ifconstrain_subpop_total
:FALSE
then each initial condition is perturbed independently and the total subpopulation size may change between runs.
that's a good idea
If
constrain_subpop_total
: TRUE things are more complicated because it is not clear the best way to simultaneously vary one or more initial conditions while making sure the total initial population size of that subpopulation doesn't change. I think the only way to do this is to take all compartments that have "perturb" by them and draw a new initial condition from a Dirichlet/Logit Normal distribution that allows for the sum of the values to be the same, with a specified mean and variance-ish-value for each:
that's a very cool idea. Note that there are functions to sample from these (https://www.pymc.io/projects/docs/en/stable/api/distributions/generated/pymc.StickBreakingWeights.html often called stick breaking weights: preserve sum). I like that.
If the method is
FromFileInputProportional
and if there is only one initial condition in a subpopulation being perturbed, then the initial condition of the compartment with amount "rest
" will also be perturbed to keep sum constant (regardless of the value in the perturb column for this entry. Might want to throw a warning if value is 0 here. ).We need to decide if perturbations will only consider integar values of the "
amount
" value when perturbing it (but allow any real value between 0 and 1 for theFromFileInputProportional Method
)
yes, I think that should be an option. In fact, it should be specifed to inference only.
Related changes
- The model should keep track of total population size over time to use as the denominator in force-of-infection rates. Total population size could potentially also be recorded in the SEIR file.
agree, and I think we should record it (so it can also be an outcome for postprocessing scripts.
- These changes will help make it more logical to adapt gempyor to be able to have 0th order input rates - like births - that increase population size, and 1st order output rates that have no destination, like deaths - both of which change total population size over time
🥳
These methods should also be though so that the interface to seeding is consistent (which I think is the case here)
init file only broadcast and save to init_files
if it's being perturbed.
If it is a resume run, the initial_conditions_file must be removed from config and method = "FromFolderInput"
Will convene meeting to do this as a group.
Just as a summary of the current initial conditions methods (following from some slack back and forth)
initial_conditions:initial_conditions_file
 is a csv or parquet file with the columns subpop, mc_name, amount
, where mc_name
 is something like S_child_unvaxxed
. amount
 is the # of individuals in the compartment at the model start time. The user does not need to give initial conditions for every compartment; those not listed will be assumed to be zero (see other config options below), and the total initial population size for each compartment will be taken as the sum of the specified initial conditions for all compartments in that subpopulationallow_missing_nodes
and proportional
(if we want to remove the use of geodata, as we need this to define the population sizes)I'm not a huge fan of these method names as I don't personally find them very intuitive - i.e. FolderDraw sounds to me like it's a random draw from a set of options but it's really just assigning the equivalent slot to the equivalent file in the folder. But also I can't think of anything better than FolderDraw, so probably fine as long as documented correctly and clearly.
If we keep SetInitialConditions
and SetInitialConditionsFolderDraw
, then FromFile
and InitialConditionsFolderDraw
should match in naming structure (since they follow the same idea: 1 file vs a folder of files. Note: see screenshot). FromFile I find counterintuitive because technically they are all from a file... 🤔 Also SetInitialConditions
not super intuitive to me because in all methods we are kind of 'setting' initial conditions...
I don't love the following suggestions, but just spit balling here...
How about some derivative of...
Alternatively could make the long/wide distinction obvious?
I prefer the first option here though.
If we want to phase out geodata: I'm not sure if we should anymore, as the runs I've been doing are using all the same initial conditions, seeding, ground truth data files (with multiple subpopulations) and using the geodata file to define just which subpops to look at in a given config. I think this is a good flow of the pipeline?
Regarding inference on initial conditions -> I'm not sure exactly what is done at the moment but we have not properly stress tested the proportional method (perturbation of initial conditions is currently broken).
I'm just adding another note about the issue of geodata file. When looking through our code for hierarchical likelihoods, I realized that the geodata file can also used to specify groups of subpopulations that should be considered to have similar parameter values. An extra column can be added, and this column will be used to calculate an additional term to the likelihood - a sort of post-hoc group-level modeling approach - that penalizes parameter proposals whenever grouped subpopulations have values that are further apart from each other (or something like that, method a bit unclear). This is a reason we need to keep the geodata file.
I am proposing a new way of specifying initial conditions and their options, getting ride of these confusing method names like SetInitialConditionsFolderDraw
. It includes ideas for how perturbations (inference) and file saving work. Check out this file for the proposed config options and their meanings. Share any feedback here or in comments on the spreadsheet! https://docs.google.com/spreadsheets/d/1ITgNAFuGKRhrwX_pvLUqaWq0OWmpCPvhjjdldvnYIA0/edit?usp=sharing (you should have access to this if you have access to the flepimop google drive folder)
Note that we are no longer proposing to remove the geodata
file. The file still exists and is used for subpop list, but the subpop sizes in there are not necessarily used - this column is not even required - they are by default taken from initial conditions. we can make very clear warnings to user about this.
Currently both geodata and initial conditions define the population and mobility. We should reduce this confusion/redundancy.