Specifying columns when simulating outbreak data

CarmenTamayo commented 8 months ago

I'm using simulist to simulate linelist data to use with {cfr}; at the moment the package requires you to specify an onset to hospitalisation delay and a column with hospitalisations is always created. However for this use case I don't really need this column, which has made me wonder whether it should be included as a requirement. Further, at the moment the epiparameter database only has data for COVID onset-hospitalisation, so it's unlikely that many users will be able to use this column unless they're working specifically with COVID.

For these reasons I'm raising this issue to start a discussion on whether hospitalisations should be required, and if the benefit outweighs the nuisance of almost always having to manually delete the column afterwards unless you want to simulate COVID data.

joshwlambert commented 8 months ago

Thanks for raising this @CarmenTamayo.

If I understand correctly, you're seeing whether not specifying the onset_to_hosp argument could be an option, in which case the $date_admission column would be removed (i.e. not created) from the line list returned by sim_linelist().

If so I think the same logic and implementation could be applied to the onset_to_death argument and resulting $date_death column.

This change is slightly complicated by the interaction of onset_to_hosp & onset_to_death with the hosp_risk, hosp_death_risk and non_hosp_death_risk arguments which would need to be cross checked to make sure the user isn't mis-specifying, but I don't think that's a major blocker of this change.

joshwlambert commented 8 months ago

I do not have strong opinions on keeping with the current implementation or moving towards what you suggested. However, I see this issue an primarily an {epiparameter} issue. The infrastructure provided by {epiparameter} - the <epidist> constructor function (epidist()) - as well as the ability to specify an anonymous function to onset_to_hosp and onset_to_death in {simulist} mean that using sim_linelist() should not depend on the {epiparameter} database, as users can easily specify their own distribution (<epidist> or anonymous function).

I think this issue should also be made over in the {epiparameter} repository, and hopefully we as a team can contribute more entries in the database. This benefits both packages.

Better documentation on the {simulist} side may also be preferable to breaking changes of this sort and the added complexity of argument interactions.

Tagging @adamkucharski @sbfnk & @Bisaloo for opinions on both the code and whether epidemiologists regularly require line list data containing hospitalisation and death data.

adamkucharski commented 8 months ago

Expect a common use case for simulist will be testing methods (e.g. calculating individual level delays to outcome, or aggregating to use with {cfr}, as well as Rt estimation etc.), so I think we need these options to be easily accessible to users. But agree that others may want to do a quick simulation of transmission chains without needing to specify a large number of parameters. Rather than changing data structure, would it be easier to just set columns like date_admission=NA if user doesn't specify a delay distribution and/or risk?

joshwlambert commented 8 months ago

Rather than changing data structure, would it be easier to just set columns like date_admission=NA if user doesn't specify a delay distribution and/or risk?

I think this is a good compromise. It would make it easier for users that did not have certain onset to event parameters and wouldn't change the dimensionality of the simulation output depending on the function arguments. This second point is beneficial when used in pipelines and the downstream pipeline expects a line list with a given number of columns and headers.

Another option, either in combination with setting arguments to NA or alone, would be to allow uses to specify single numbers (i.e. scalars) for the arguments that currently require a distribution (either <epidist> or anonymous function). This would produce a constant/deterministic delay time between events, so would not be very useful for simulating realistic line lists, but would provide a bridge for users who want to quickly simulate data but do not need the detail of parameterised distributions for: contact distribution, infectious period, onset-to-hospitalisation or onset-to-death.

Thoughts on this proposal would be good before I work on the implementation.

Bisaloo commented 7 months ago

Being able to not specify the arguments listed here makes sense to me
I agree with the idea of keeping the output structure constant and set columns to NA where necessary
For scalars, let's wait on having a specific use case as to not overcomplicate this package for something that may end up not being used.

joshwlambert commented 7 months ago

Thanks all for input on this issue. I will move forward implementing the option to specify onset_to_hosp and onset_to_death arguments as NA which will set the resulting $date_admission and $date_death columns to NA.

joshwlambert commented 7 months ago

The last discussion point for this feature, when the user inputs an NA for onset_to_hosp or onset_to_death and leaves the input for hosp_risk, hosp_death_risk or non_hosp_death_risk as a number (I plan to leave the default values as they are now, 0.2, 0.5, 0.05, respectively), should the function warn or error?

My preference is for it to warn with a message that the risks that the relevant risks are being ignored and should be set to NA. This will add some noise to the output but it seems like a simple mistake for a user to make. The error will ensure that all the arguments are consistent/compatible by forcing the user to change the *_risk argument from their default values but might put users off if it feels too complex.

Bisaloo commented 7 months ago

A warning is slightly more common in this situation but happy to let you decide since both options are defensible.

epiverse-trace / simulist

Specifying columns when simulating outbreak data #94