HopkinsIDD / flepiMoP

The Flexible Epidemic Modeling Pipeline
https://flepimop.org
GNU General Public License v3.0
9 stars 4 forks source link

Update initial condition specifications and use of geodata file #82

Open shauntruelove opened 11 months ago

shauntruelove commented 11 months ago

Currently both geodata and initial conditions define the population and mobility. We should reduce this confusion/redundancy.

alsnhll commented 11 months ago

I don't understand what this means. The initial conditions file does not define the population and mobility. However, in the initial conditions file, if you specify the initial size of each compartment in a population, it should add up to the total population size specified in geodata. I THINK right now an error will be thrown if it doesn't. ALSO, there is an option in the code that you don't have to give the ICs for each compartment, but can give it for some and then write "rest" for another compartment, and it will use the total population size in geodata, minus the other compartments specified already in IC, to populate the ICs for this "rest" compartment. BUT, I believe this "rest" option is not currently working for some reason - I tried it and it failed, despite being in code.

alsnhll commented 11 months ago

Related: Perhaps consider renaming "geodata" to something more general like "popstruct"

jcblemai commented 11 months ago

feodata and initial conditions are doing the same things. Ideally subpopulations should be defined by a single file (like the location.csv in the hubverse notation).

The population will be allowed to vary, for e.g birth and death processes. Currently geodata's population is used as node denominator but when initial_condition population is not equal, it has some issue. These are solved currently by failing when the total don't match. However, it has hard to get matching totals with floating point errors.

Discussion on issue #91

jcblemai commented 11 months ago

Now the issue of 82 is that: geodata specifies to gempyor:

alsnhll commented 11 months ago

I thought a lot about this last few days and agree with the issues with geodata file vs initial_conditions, and have detailed a proposal for dealing with it all:

jcblemai commented 11 months ago

Thanks @alsnhll, that's some really great and consistent choices. Some comments

Non proportional methods

  • get rid of config: setup_subpop option and geodata type file

I think we'd still need a list of supop names somewhere: it's important that in case of the FromFolder method, the plotting script and checks would make that important. The subpop setup also contains mobility.

  • require theinitial_conditions section of the config. By definition, any dynamical system must have initial conditions so this makes sense

Agree.

  • Allow the following initial condition methods (I just made up names for now, can be changed, but would love something clearer than what we have now)

I agree, our names are bad and need changes. I could not find any better names than what you are proposing, but I'm not a fan of FromFileOutput.

  • method:FromFileInput
    • This should be the default and simplest option for users making simple config files
    • initial_conditions:initial_conditions_file is a csv or parquet file with the columns subpop, mc_name, amount, where mc_name is something like S_child_unvaxxed. amount is the # of individuals in the compartment at the model start time. The user does not need to give initial conditions for every compartment; those not listed will be assumed to be zero (see other config options below), and the total initial population size for each compartment will be taken as the sum of the specified initial conditions for all compartments in that subpopulation

Yes. So that's very close to what is implemented. Let's just keep in mind that while gempyor allows mc_name, I prefer user (i.e the documentation) to instead submit columns mc_vaccinatation_stage, mc_infection_stage as these are clearer from the config (mc_name is a unique compartment name created by gempyor -- which may change, and is ordered in the same order as the config, but still).

* Other config options:

  * `allow_missing_compartments`: `TRUE` or `FALSE`. Default to` FALSE`, will throw error if initial conditions are not specified for every single compartment in each subpopulation. If `TRUE`, will assume missing compartments have zero initial condition.
  * REMOVE the former "`allow_missing_nodes`" option, as now if we don't list an initial_condition for a node the population size will remain zero forever

Agree, it's a good change, but allow_missing_subpops is good because it raises an error which is sometime convenient ? I don't know

  * REMOVE the former "`proportional`" option, as we now need initial conditions file to have information on the total population size

agree

* when model runs, creates init file as output directory type, that`init`file has initial conditions for ALL compartments in the same format as `model_output`, one file for each slot+iteration

Soo... I think it's useful to not have that everytime. I was thinking about an alternative config option to "broadcast" seeding/initial_conditions to file system structure. if activated, it would move the seeding/ic_file to the model_output/seed or init so e.g inference or other scripts can be run on these (using 1 starting value). But by default de-activated (instead of what you are proposing) because with that config:

initial_conditions:
  method: FromFileInput
  initial_conditions_file: data/my_ic.csv

if you run gempyor two times with the same run_id, then the second time it takes from the filesystem even if you had modified your data/my_ic.csv. I feel like this is confusing behaviour and better as an opt-in (moreover, less files to upload/read write is good).

  • method: FromFileOutput

    • This option is for when users want to use the output of a previous simulation for the initial conditions of the current simulation
    • initial_conditions:initial_conditions_file is a csv or parquet file with the columns something like mc_value_type, mc_strata1, mc_strata2, .....,mc_strataN, mc_name, subpop_name_1, .... subpop_name_n, date. All compartments for all subpopulations must be listed. The total initial population size for each compartment will be taken as the sum of the specified initial conditions for all compartments in that subpopulation at the time of the simulation start
    • No other config options

agree

Proportional methods

  • method:FromFileInputProportional
    • This method is similar to FromFileInput except that the user can give the initial conditions as fractions of the total population size, and must specify a separate file with the initial total population sizes for each subpopulation -initial_conditions: subpop_file is a csv or parquet file with the columns subpop, population , where "population" is the # of individuals, just like the existing geodata file.

subpop_population_file perhaps ?

* `initial_conditions:initial_conditions_file` is a csv or parquet file with the columns `subpop, mc_name, proportion`, where `mc_name` is something like `S_child_unvaxxed`. "`proportion`" is a fraction for % of population initially in that compartment, or, the term "`rest`", which means the fraction not specified will be allocated to this compartment. The user does not need to give initial conditions for every compartment; those not listed will be assumed to be zero (see other config options below). Each subpop must have at least one entry, and the sum of those entries must be less than 1.  If "`proportion`" is "`rest`" for the only compartment specified, the entire population size is assigned there

True, this is very close to what is currently implemented (save for the column name).

  -`method: FromFolderInput`
* This method is identical to `FromFileInput`, except that now the user specifies a directory instead of a file. Inside that directory there should be list of files, and their names should be numbers that will correspond to the independent simulation numbers (slots for inference or just simulations for non-inference runs). (Note: before we had to specify the entire output file name with runID etc but I don't see why this is needed)

I think we currently impose that this is factored in flepimop filesystem, which is good and in line with current effort to make the config more rigid.

  -` method:FromFolderOutput`
* This method is identical to `FromFileOutput`, except that now the user specifies a directory instead of a file. Inside that directory there should be list of files, and their names should be numbers that will correspond to the independent simulation numbers (slots for inference or just simulations for non-inference runs). (Note: before we had to specify the entire output file name with runID etc but I don't see why this is needed)

About inference

  • For doing inference on initial conditions

    • I am not 100% sure how this was being handled before, so maybe there's any easier way of thinking about this that's already implemented. This is all I could come up with
    • if the "perturbation" option exists in the initial_conditions section, the inference can be performed on initial conditions. Each initial conditions file, regardless of type, should now have a perturbcolumn with values 0 or 1 if that initial condition should be perturbed.

yes, I believe that is what is done at the moment but I'll check. However right now we can only fit proportional initial conditions files (to keep the population total while being simpler)

  • add extra config option constrain_subpop_total which is TRUE or FALSE (default FALSE) and describes whether the perturbations must preserve the total subpopulation size or not. If constrain_subpop_total : FALSEthen each initial condition is perturbed independently and the total subpopulation size may change between runs.

that's a good idea

that's a very cool idea. Note that there are functions to sample from these (https://www.pymc.io/projects/docs/en/stable/api/distributions/generated/pymc.StickBreakingWeights.html often called stick breaking weights: preserve sum). I like that.

  • If the method is FromFileInputProportional and if there is only one initial condition in a subpopulation being perturbed, then the initial condition of the compartment with amount "rest" will also be perturbed to keep sum constant (regardless of the value in the perturb column for this entry. Might want to throw a warning if value is 0 here. ).

  • We need to decide if perturbations will only consider integar values of the "amount" value when perturbing it (but allow any real value between 0 and 1 for the FromFileInputProportional Method)

yes, I think that should be an option. In fact, it should be specifed to inference only.

  • Related changes

    • The model should keep track of total population size over time to use as the denominator in force-of-infection rates. Total population size could potentially also be recorded in the SEIR file.

agree, and I think we should record it (so it can also be an outcome for postprocessing scripts.

  • These changes will help make it more logical to adapt gempyor to be able to have 0th order input rates - like births - that increase population size, and 1st order output rates that have no destination, like deaths - both of which change total population size over time

🥳

These methods should also be though so that the interface to seeding is consistent (which I think is the case here)

shauntruelove commented 10 months ago

init file only broadcast and save to init_files if it's being perturbed. If it is a resume run, the initial_conditions_file must be removed from config and method = "FromFolderInput"

shauntruelove commented 10 months ago

Will convene meeting to do this as a group.

saraloo commented 6 months ago

Just as a summary of the current initial conditions methods (following from some slack back and forth)

Methods:

I'm not a huge fan of these method names as I don't personally find them very intuitive - i.e. FolderDraw sounds to me like it's a random draw from a set of options but it's really just assigning the equivalent slot to the equivalent file in the folder. But also I can't think of anything better than FolderDraw, so probably fine as long as documented correctly and clearly. If we keep SetInitialConditions and SetInitialConditionsFolderDraw, then FromFile and InitialConditionsFolderDraw should match in naming structure (since they follow the same idea: 1 file vs a folder of files. Note: see screenshot). FromFile I find counterintuitive because technically they are all from a file... 🤔 Also SetInitialConditions not super intuitive to me because in all methods we are kind of 'setting' initial conditions...

I don't love the following suggestions, but just spit balling here...

How about some derivative of...

Alternatively could make the long/wide distinction obvious?

I prefer the first option here though.

Some random other notes relating to the previous comments on this issue:

If we want to phase out geodata: I'm not sure if we should anymore, as the runs I've been doing are using all the same initial conditions, seeding, ground truth data files (with multiple subpopulations) and using the geodata file to define just which subpops to look at in a given config. I think this is a good flow of the pipeline?

Regarding inference on initial conditions -> I'm not sure exactly what is done at the moment but we have not properly stress tested the proportional method (perturbation of initial conditions is currently broken).

alsnhll commented 6 months ago

I'm just adding another note about the issue of geodata file. When looking through our code for hierarchical likelihoods, I realized that the geodata file can also used to specify groups of subpopulations that should be considered to have similar parameter values. An extra column can be added, and this column will be used to calculate an additional term to the likelihood - a sort of post-hoc group-level modeling approach - that penalizes parameter proposals whenever grouped subpopulations have values that are further apart from each other (or something like that, method a bit unclear). This is a reason we need to keep the geodata file.

alsnhll commented 5 months ago

I am proposing a new way of specifying initial conditions and their options, getting ride of these confusing method names like SetInitialConditionsFolderDraw. It includes ideas for how perturbations (inference) and file saving work. Check out this file for the proposed config options and their meanings. Share any feedback here or in comments on the spreadsheet! https://docs.google.com/spreadsheets/d/1ITgNAFuGKRhrwX_pvLUqaWq0OWmpCPvhjjdldvnYIA0/edit?usp=sharing (you should have access to this if you have access to the flepimop google drive folder) Note that we are no longer proposing to remove the geodata file. The file still exists and is used for subpop list, but the subpop sizes in there are not necessarily used - this column is not even required - they are by default taken from initial conditions. we can make very clear warnings to user about this.