Proposed future workflow

sbfnk commented 2 days ago

In the process of addressing #131 workflows for estimating contact matrices will change. The proposal is to move some of the functionality hidden inside the contact_matrix() function into separate functions. This means that instead of (currently)

contact_matrix(polymod, countries = "United Kingdom", age.limits = c(0, 1, 5, 15), symmetric = TRUE)

One would do something like

uk_contacts <- polymod |>
  filter(country == "United Kingdom") |>
  process_age(age_limits = c(0, 1, 5, 15))

uk_pop <- uk_contacts |>
  country_population()

cm <- uk_contacts |>
  contact_matrix(by = "age", symmetric = TRUE, survey.pop = uk_pop)

It's quite a lot more typing but hopefully more explicit and less black box processing thus less likely to lead to misinterpretation / misunderstanding.

joshwlambert commented 2 days ago

I think the current functionality is a lot cleaner and will be easier for new users coming to the package. It looks like at least one of the functions in the second code chunk are already exported from {socialmixr}, therefore I have two questions:

Could you not offer both sets of functionality, i.e. retain the original function and allow users to do the second if they wish (the contact_matrix() function in the second code chunk would need a different name)?
What are the misinterpretations/misunderstandings that you see users having with the current approach? As I'm not a regular user of this package it might help clarify how the proposed approach addresses these.

sbfnk commented 2 days ago

Thanks, looks like some motivation/context was missing from the original post. The contact_matrix() function currently has 22 arguments and passes arguments in ellipses ... to three other functions, and 800 lines of code, making it unwieldy both from a user and developer perspective. The rationale for removing the processing of demographic data, filtering of surveys etc. out of the function is to expose more of the internals such as how age is processed, what demographic data is used etc. whilst having smaller functions with fewer arguments. The downside is more function calls for the user, especially when previously using arguments defaults.

One related question is whether we want to be able to address #143 - at the moment the contact_matrix() function does a lot of manipulation of age but if it is to allow generation of general contact matrices (e.g. by sex) then it really makes sense to move all the functionality for estimating/grouping ages into other functions because users might not be interested in age, but also because if not contact_matrix() will have to grow even bigger / need more arguments.

Re (1) yes we could and perhaps it's the way forward, but from a maintenance point of view it would help to be a bit more prescriptive on the workflow.

bahadzie commented 1 day ago

 uk_contacts <- polymod |>
  as_contact_survey() |>
  filter(country == "United Kingdom") |>
  process_age(age_limits = c(0, 1, 5, 15))

uk_pop <- uk_contacts |>
  country_population()

cm <- uk_contacts |>
  contact_matrix(by = "age", symmetric = TRUE, survey.pop = uk_pop)

I would change

contact_matrix(by = "age", symmetric = TRUE, survey.pop = uk_pop) can become

to

`cm(by = "age", symmetric = TRUE, survey.pop = uk_pop)

On CRAN {o2geosocial} and {finalsize} are reverse (Suggests:) dependencies. Not sure how we feel about potential breaking changes. If long term the idea is to move to {contactmatrix} then I think we should maintain contact_matrix()'s current function signature but still refactor the code to split it out into smaller functions that will reduce its cyclomatic complexity which at 143 is almost 10X the recommended {lintr} limit of 15.

IMO treating age as a variable useful for stratification simplifies implementing #143 . Of course we still need the existing age estimation functionality but this can be a separate function that is called by contact_matrix()

Section comments (below) from contact_matrix() are a useful blueprint for potential functions.

  ## === check arguments and define variables
  ## === check and clean survey
  ## === check if specific countries are requested (if a survey contains data from multiple countries)
  ## === age processing: deal with ranges and missing data
  ## === check if any filters have been requested
  ## === adjust age.group.brakes to the lower and upper ages in the survey
  ## === if split, symmetric or age weights are requested, get demographic data (survey population)
  ## === process weights
  ## === merge participants and contacts into a single data table
  ## === process contact age ranges / missing ages
  ## === sample participants randomly (if requested)
  ## === calculate weighted contact matrix
  ## === split contact matrix
  ## === option to add matrix per capita, i.e. the contact rate of age i with one individual of age j in the population.
  ## === option to return participant weights

Potential mappings are...

as_contact_survey(){
  ## === check and clean survey
  ## === check if specific countries are requested (if a survey contains data from multiple countries)
  ## === merge participants and contacts into a single data table
}

filter() {
  ## === check if any filters have been requested
}

process_age(){
  ## === age processing: deal with ranges and missing data
  ## === adjust age.group.brakes to the lower and upper ages in the survey
  ## === process contact age ranges / missing ages
}

as_contact_survey(){
  ## === check if specific countries are requested (if a survey contains data from multiple countries)
}

joshwlambert commented 1 day ago

Thanks for the additional context. The misunderstanding/misinterpretation comes from the number of arguments in contact_data() and the misuse of argument defaults?

One related question is whether we want to be able to address https://github.com/epiforecasts/socialmixr/issues/143 - at the moment the contact_matrix() function does a lot of manipulation of age but if it is to allow generation of general contact matrices (e.g. by sex) then it really makes sense to move all the functionality for estimating/grouping ages into other functions because users might not be interested in age, but also because if not contact_matrix() will have to grow even bigger / need more arguments.

I agree that the proposed refactor is the right development direction, as the single point of user interaction (contact_matrix()) does not scale well as you generalise.

If long term the idea is to move to {contactmatrix} then I think we should maintain contact_matrix()'s current function signature

The plan is to use the <contactmatrix> class from {contactmatrix} in {socialmixr} but not move functionality currently in {socialmixr} into {contactmatrix}? Is my understanding of this correct @Bisaloo & @sbfnk?

sbfnk commented 1 day ago

Thanks for the additional context. The misunderstanding/misinterpretation comes from the number of arguments in contact_data() and the misuse of argument defaults?

What I meant was the risk is from doing lots of processing inside the function that isn't really exposed to the user (e.g. there is no way for the user currently to see the result of interpolating within age ranges, or of puling in and manipulating demographic data - they only see the end result in the matrix). In that sense I see a risk of misinterpretation due to lack of visibility of what happens inside the contact_matrix() function. That said I have no evidence / examples of this actually having ever been an issue.

epiforecasts / socialmixr

Proposed future workflow #161