Vector inputs for ODE models

pratikunterwegs commented 6 months ago

This is a substantial user-facing and internal update to {epidemics}. Please treat any review as a full-package review.

Context

The {epidemics} use case was identified to be multiple (100s -- 1000s) runs of each model; this is because:
Users are expected to want to incorporate parameter uncertainty directly rather than manually looping over parameters;
Users are expected to want to run multiple scenarios with parameter uncertainty, and compare across them without the comparison being biased by differences in the random number draws.
See also
- This discussion on the use case;
- Issue #160 for discussions on the possible new interface;
- This GitHub Gist for initial rough implementations of the new interface.

Changes in this PR

All models have only a single exported version, named model_*(); this is WIP #162
- This is a breaking change with downstream effects on e.g. training materials, cc-ing @avallecam to notify of incoming changes;
- All ODE models' exported versions call the Rcpp internal functions which use Boost solvers; these are exposed to R as .model_*_cpp()
- An open question is whether and how best to test the remaining R-only ODE system code
All ODE models accept infection parameters (and time_end) as numeric vectors; this fixes #166
- Model function bodies have been changed to check input infection parameters, check that they are recyclable following Tidyverse rules, and to create a parameter table;
- Model function bodies now use {data.table} extensively;
- Passing scalar values to infection parameters returns a simple <data.table> rather than a nested one with a single row. This is to prevent breaking changes to any existing users. It is also probably more appropriate as a single model run is unlikely to require returning the parameters. The return is a <data.table> rather than a <data.frame> to avoid differences in return type. This fixes #177
All ODE models accept lists of intervention sets, and lists of <vaccination> as inputs to intervention and vaccination respectively; this fixes #167
- An intervention set is a list of <intervention>s; the new input supports lists of lists of <intervention>s
- It is not yet possible to pass a list to population, time_dependence, or population_change --- it is an open question as to whether the latter two are necessary, but the need for multiple populations is noted in #181 but not tackled here;
- {data.table} is used extensively to create intervention combinations within functions
- Cross-checking inputs such as interventions has been simplified with the introduction of general functions .cross_check_*() which are used in model-specific argument checker/preparation functions; this fixes #175
An updated and renamed version of the vignette on "Modelling parameter uncertainty" now shows how to pass infection parameters as vectors, and how to pass intervention sets and vaccinations as lists, this fixes #183
All ODE models are now tested more extensively for scalar and vector inputs, as well as error messages; this provisionally fixes #178
The Vacamole model has been restructured as well to match the internal ODE structure; this provisionally fixes #143:
- Susceptibility reduction due to vaccination (susc_reduction_vax) is now transmissibility for vaccinated individuals (transmissibility_vax);
- Hospitalisation reduction due to vaccination (hosp_reduction_vax) is now hospitalisation for vaccinated individuals (hospitalisation_rate_vax);
- Mortality reduction due to vaccination (mort_reduction_vax) is now hospitalisation for vaccinated individuals (mortality_rate_vax);
- All arguments have the same default value as earlier, i.e., 80% of the value for unvaccinated individuals;
- These parameters can now be targeted by <rate_intervention>s
Adds @TimTaylor and @adamkucharski as authors
Standardises file naming for the R code, the C++ source and header code, vignettes, and test files

Planned changes not in this PR

Vector inputs to the stochastic Ebola model;
More tests for the stochastic Ebola model;
More tests for S3 classes;
Handling remaining R-only ODE model code;
General package health improvements per {goodpractice} suggestions
See this GitHub project for more and/or linked issues.

github-actions[bot] commented 5 months ago

This pull request:

Adds 0 new dependencies (direct and indirect)
Adds 0 new system dependencies
Removes 1 existing dependencies (direct and indirect)
Removes 1 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

github-actions[bot] commented 5 months ago

This pull request:

Adds 0 new dependencies (direct and indirect)
Adds 0 new system dependencies
Removes 1 existing dependencies (direct and indirect)
Removes 1 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

github-actions[bot] commented 5 months ago

This pull request:

Adds 0 new dependencies (direct and indirect)
Adds 0 new system dependencies
Removes 1 existing dependencies (direct and indirect)
Removes 1 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

github-actions[bot] commented 5 months ago

This pull request:

Adds 4 new dependencies (direct and indirect)
Adds 1 new system dependencies
Removes 1 existing dependencies (direct and indirect)
Removes 1 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

pratikunterwegs commented 5 months ago

Thanks @jamesmbaazam for the feedback - happy to hear more on some of the points you've raised.

Other comments

In .check_prepare_args_default(), lots of subsetting are done in a pattern that could potentially be turned into helper functions. Can subsetting operations like mod_args[["vaccination"]][["time_begin"]] be combined into a subsetting function that take an "intervention" and "element" argument? This would reduce the hardcoding of the list elements and make the code more readable and maintainable.

I see get_parameter() has been removed but I see the replaced subsetting operations are identical to what get_parameter() would do. Why was it removed instead of being updated to do the new list subsetting? Could you convert all the x$<some element> into a getter function like get_parameter() previously? This would make the code more readable.

Taking these two together as they're related - I felt get_parameter() wasn't offering much over inbuilt accessors. We have a preference for list-based S3 classes, so the [[ accessor will likely remain robust. get_parameter() was also exported, which meant adding input checks, which I felt was making it slow down internal functions when it was used in them (as in .check_prepare_args()). Stripping those checks away for an internal-use version, say .get_parameter() did not seem to offer much more over [[. In .check_prepare_args_*() for example, get_parameter() use looks like get_parameter(mod_args[["some element"]], "some member").

Can the details section of all the check_args_*() functions be combined into a single template? They seem to be quite similar and repeated and might be easier to maintain if turned into one.

Do you mean the overall documentation for these functions named .check_prepare_args_*()? They could be combined. The Return section would be quite long, since it would repeat elements a few times, once per model in case of slight differences.

Can the R/dummy_elements.R script be renamed to something more descriptive? The no_time_dependence() function could be moved to the time_dependence.R script and the no_population_change() function moved somewhere with related content. This will make it easier to find the functions when needed.

I can return the dummy-elements functions to files that hold the classes and do away with this one.

output_to_df() should probably be renamed to output_to_dt() since it returns a data.table. The return value should also be changed from data.frame to data.table to reflect the code.

Thanks - I'll change the documented return type. I can rename this fn to .output_to_table(), as I'm not sure that we'll stick with <data.table>.

epidemic_size():

The return value needs to be updated to say it returns a data.table.

Thanks, got it.

Could you enhance it to return the epidemic size by a selection of age groups?

This is a nice feature in theory, but in practice it may be good to restrict it to the current implementation. Demography groups in the population do not have to be named, but might be depending on the whether the demography vector or contact matrix have names. It might be easier for users to filter a resulting data.frame of demography group, time, and size, on the names of the demography group as shown in the column. I can add an issue for epidemic_size() to return a table-like rather than a vector.

Could you return the "stage" as part of the output? I can easily imagine that being useful in cases where a user wants to plot the epidemic size at various stages of the outbreak.

I would suggest renaming "stage" to timepoint or something similar and filter by time instead of stage. We often think of epidemic timelines in terms of dates and not stages. Using stages forces you to think in terms of percentages of the whole timeline. Additionally, you could make "stage" an optional argument alongside the new "time" argument.

Yes to returning the stage as well as renaming it - collecting these feature requests in #190.

The reason epidemic_size() uses a percentage term is to quickly access e.g. the last size, the size at 50% time, etc., in a way that's agnostic to the actual time the epidemic runs. I can shift to a timepoint-based implementation (with optional time, returning final size by default) if that's more useful/more understandable. It would help to consider the expected output when a simulation does not run up to the requested time - should it error, or return a size of 0, or return NA, or return the last size available? Note that the last size might not be the 'final size' as an epidemic might not be complete by the simulation end time.

Is the term "demography_vector" chosen for any reason? I think in the contexts it's used here, it mostly refers to the population size per age group. Could you rename it to some like "group_pop_size" or something similar since demography is too broad a term?

I think this was adopted from {finalsize}, which adopted it from earlier implementations. I think it's not too difficult to understand so unless there's a good reason to change it, I'll keep it to conform with {finalsize}.

Can population_change in model_diphtheria() be structured in a manner to take a time series of population changes? For example, I might want to model a population change that occurs at a certain time point and then another change that occurs at a later time point. This would allow for more flexibility in the model.

Yes; here's an example showing both increases and decreases in certain groups, t = 70, and t = 100:

# the `values` list must always have as many elements as the `time` vector
# and each element of `values` must be of the same length as `demography_vector`
pop_change <- list(
  time = c(70, 100)
  values = list(
    change_1 = c(1e4, 1e5, 2e5),
    change_2 = c(-9e3, 1e3, -1e5)
  )
)

I considered adding this to the diphtheria vignette but this functionality might need more thinking through before a more detailed example is added.

Can model_vacamole() be renamed to model_covid19() or something descriptive so as to align with the other models named after the disease they represent? I can understand if you're trying to use the original name to conserve the credit (or maybe I'm not privy to the reuse agreement) but credit can be given in the function documentation and README as is currently done.

It could indeed be. One reason I would not do that is that the philosophy of {epidemics} is now more towards collecting published models that can target types of diseases rather than specific diseases, e.g., the default model was recently used for a range of 11 diseases. Vacamole might be suitable for flu etc. as well, and the name refers to the two-dose leaky vaccination scenario with a vaccination-mediated infection pathway it implements, so I think it would be good to retain it.

I would suggest doing the input checks outside of the model_*() functions to make the code more focused on the model rather than being overpowered with input checks.

Do you mean that in an analysis script users should run input checking on their arguments first, and then pass the checked arguments to model functions? I am not against that - it would certainly make the model functions lighter and more readable.

My feeling has been that users would appreciate self-contained functions more, where they only really have to create a <population> and pass it to a function which will do the rest. Passing a checked argument list would probably require adding default_args_model_*() functions, which users would have to modify to pass their own values. This is the approach I've taken in {noromod} for @bolthikaru. Happy to hear inputs on the benefits on this approach. Also cc-ing @TimTaylor in case interested.

jamesmbaazam commented 5 months ago

Thanks @jamesmbaazam for the feedback - happy to hear more on some of the points you've raised.

Other comments

In .check_prepare_args_default(), lots of subsetting are done in a pattern that could potentially be turned into helper functions. Can subsetting operations like mod_args[["vaccination"]][["time_begin"]] be combined into a subsetting function that take an "intervention" and "element" argument? This would reduce the hardcoding of the list elements and make the code more readable and maintainable.

I see get_parameter() has been removed but I see the replaced subsetting operations are identical to what get_parameter() would do. Why was it removed instead of being updated to do the new list subsetting? Could you convert all the x$<some element> into a getter function like get_parameter() previously? This would make the code more readable.

Taking these two together as they're related - I felt get_parameter() wasn't offering much over inbuilt accessors. We have a preference for list-based S3 classes, so the [[ accessor will likely remain robust. get_parameter() was also exported, which meant adding input checks, which I felt was making it slow down internal functions when it was used in them (as in .check_prepare_args()). Stripping those checks away for an internal-use version, say .get_parameter() did not seem to offer much more over [[. In .check_prepare_args_*() for example, get_parameter() use looks like get_parameter(mod_args[["some element"]], "some member").

Thanks for the explanation. It makes sense to me

Can the details section of all the check_args_*() functions be combined into a single template? They seem to be quite similar and repeated and might be easier to maintain if turned into one.

Do you mean the overall documentation for these functions named .check_prepare_args_*()? They could be combined. The Return section would be quite long, since it would repeat elements a few times, once per model in case of slight differences.

Yes, I meant the overall documentation. I suggested that after noticing the patterns, but it might not make sense to combine them.

Can the R/dummy_elements.R script be renamed to something more descriptive? The no_time_dependence() function could be moved to the time_dependence.R script and the no_population_change() function moved somewhere with related content. This will make it easier to find the functions when needed.

I can return the dummy-elements functions to files that hold the classes and do away with this one.

Yes, it would make it easier to navigate the code base.

output_to_df() should probably be renamed to output_to_dt() since it returns a data.table. The return value should also be changed from data.frame to data.table to reflect the code.

Thanks - I'll change the documented return type. I can rename this fn to .output_to_table(), as I'm not sure that we'll stick with <data.table>.

Alright.

epidemic_size():

The return value needs to be updated to say it returns a data.table.

Thanks, got it.

Could you enhance it to return the epidemic size by a selection of age groups?

This is a nice feature in theory, but in practice it may be good to restrict it to the current implementation. Demography groups in the population do not have to be named, but might be depending on the whether the demography vector or contact matrix have names. It might be easier for users to filter a resulting data.frame of demography group, time, and size, on the names of the demography group as shown in the column. I can add an issue for epidemic_size() to return a table-like rather than a vector.

Could you return the "stage" as part of the output? I can easily imagine that being useful in cases where a user wants to plot the epidemic size at various stages of the outbreak.

I would suggest renaming "stage" to timepoint or something similar and filter by time instead of stage. We often think of epidemic timelines in terms of dates and not stages. Using stages forces you to think in terms of percentages of the whole timeline. Additionally, you could make "stage" an optional argument alongside the new "time" argument.

Yes to returning the stage as well as renaming it - collecting these feature requests in #190.

Great

It would help to consider the expected output when a simulation does not run up to the requested time - should it error, or return a size of 0, or return NA, or return the last size available?

It is often a difficult situation but sticking to a simple and easy-to-debug solution for now might do for a first pass. It can be improved with user feedback. I'm inclined to suggest returning the last size with a warning and some text on how to interpret it.

Note that the last size might not be the 'final size' as an epidemic might not be complete by the simulation end time.

Wouldn't this be the case for stage = 1 too?

Is the term "demography_vector" chosen for any reason? I think in the contexts it's used here, it mostly refers to the population size per age group. Could you rename it to some like "group_pop_size" or something similar since demography is too broad a term?

I think this was adopted from {finalsize}, which adopted it from earlier implementations. I think it's not too difficult to understand so unless there's a good reason to change it, I'll keep it to conform with {finalsize}.

Can population_change in model_diphtheria() be structured in a manner to take a time series of population changes? For example, I might want to model a population change that occurs at a certain time point and then another change that occurs at a later time point. This would allow for more flexibility in the model.

Yes; here's an example showing both increases and decreases in certain groups, t = 70, and t = 100:
# the `values` list must always have as many elements as the `time` vector
# and each element of `values` must be of the same length as `demography_vector`
pop_change <- list(
  time = c(70, 100)
  values = list(
    change_1 = c(1e4, 1e5, 2e5),
    change_2 = c(-9e3, 1e3, -1e5)
  )
)

Ah, nice!

I considered adding this to the diphtheria vignette but this functionality might need more thinking through before a more detailed example is added.

Can model_vacamole() be renamed to model_covid19() or something descriptive so as to align with the other models named after the disease they represent? I can understand if you're trying to use the original name to conserve the credit (or maybe I'm not privy to the reuse agreement) but credit can be given in the function documentation and README as is currently done.

It could indeed be. One reason I would not do that is that the philosophy of {epidemics} is now more towards collecting published models that can target types of diseases rather than specific diseases, e.g., the default model was recently used for a range of 11 diseases. Vacamole might be suitable for flu etc. as well, and the name refers to the two-dose leaky vaccination scenario with a vaccination-mediated infection pathway it implements, so I think it would be good to retain it.

That makes sense for the generic models but there are other models named after diseases here as well. No?

I would suggest doing the input checks outside of the model_*() functions to make the code more focused on the model rather than being overpowered with input checks.

Do you mean that in an analysis script users should run input checking on their arguments first, and then pass the checked arguments to model functions? I am not against that - it would certainly make the model functions lighter and more readable.

Oh no, I see I didn't phrase that well. I meant that, for readability, you should consider bundling up the checks into helper functions as is done in our other packages like simulist. But this might be a matter of style.

Your suggestion here is also good but could potentially lead to downstream issues if you don't put in measures to ensure the user checks their inputs first.

My feeling has been that users would appreciate self-contained functions more, where they only really have to create a <population> and pass it to a function which will do the rest. Passing a checked argument list would probably require adding default_args_model_*() functions, which users would have to modify to pass their own values. This is the approach I've taken in {noromod} for @bolthikaru. Happy to hear inputs on the benefits on this approach. Also cc-ing @TimTaylor in case interested.

pratikunterwegs commented 5 months ago

Note that the last size might not be the 'final size' as an epidemic might not be complete by the simulation end time.

Wouldn't this be the case for stage = 1 too?

Yes, it's simply the last size in the time series. Some user care is required in determining whether it is really the 'final size', after looking at the trajectory.

That makes sense for the generic models but there are other models named after diseases here as well. No?

There are - we don't have good naming options for these. We have not made much progress in implementing more models, so general rules for model names have not really been developed.

pratikunterwegs commented 5 months ago

Do you mean that in an analysis script users should run input checking on their arguments first, and then pass the checked arguments to model functions? I am not against that - it would certainly make the model functions lighter and more readable.

Oh no, I see I didn't phrase that well. I meant that, for readability, you should consider bundling up the checks into helper functions as is done in our other packages like simulist. But this might be a matter of style.

This could indeed be done. In {finalsize} some of the discussion were around how much to show this checking and preparation, and the idea was to strike a balance to prevent the codebase spreading over too many files. To some extent this is the role of the .check_prepare_args_*() functions.

pratikunterwegs commented 5 months ago

Can the R/dummy_elements.R script be renamed to something more descriptive? The no_time_dependence() function could be moved to the time_dependence.R script and the no_population_change() function moved somewhere with related content. This will make it easier to find the functions when needed.

I can return the dummy-elements functions to files that hold the classes and do away with this one.

These elements don't have a better place to go, as they are not classes.

github-actions[bot] commented 5 months ago

This pull request:

Adds 4 new dependencies (direct and indirect)
Adds 1 new system dependencies
Removes 1 existing dependencies (direct and indirect)
Removes 1 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

github-actions[bot] commented 5 months ago

This pull request:

Adds 4 new dependencies (direct and indirect)
Adds 1 new system dependencies
Removes 1 existing dependencies (direct and indirect)
Removes 1 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

pratikunterwegs commented 5 months ago

Thanks both @jamesmbaazam and @adamkucharski for your reviews. @CarmenTamayo, you mentioned you might have some feedback as well, would you like to add it? Otherwise I think this PR is ready to merge.

CarmenTamayo commented 5 months ago

Thanks both @jamesmbaazam and @adamkucharski for your reviews. @CarmenTamayo, you mentioned you might have some feedback as well, would you like to add it? Otherwise I think this PR is ready to merge.

Hi @pratikunterwegs I provided some feedback as well, specifically for the new vignette for modelling parameter uncertainty, can you see my comments? I did this last week -> I hadn't submitted the comments but they're available now- apologies for this

pratikunterwegs commented 5 months ago

Sorry Pratik, I had pending comments but hadn't actually submitted them

Thanks! I remembered you'd asked me questions about some parts of this PR so I knew there must be comments somewhere!

github-actions[bot] commented 5 months ago

This pull request:

Adds 4 new dependencies (direct and indirect)
Adds 1 new system dependencies
Removes 1 existing dependencies (direct and indirect)
Removes 1 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

github-actions[bot] commented 5 months ago

This pull request:

Adds 4 new dependencies (direct and indirect)
Adds 1 new system dependencies
Removes 1 existing dependencies (direct and indirect)
Removes 1 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

pratikunterwegs commented 5 months ago

Thanks all for your reviews - merging this now so we can make smaller PRs in future for related issues.

epiverse-trace / epidemics