Closed CarmenTamayo closed 7 months ago
Thanks for raising this @CarmenTamayo.
If I understand correctly, you're seeing whether not specifying the onset_to_hosp
argument could be an option, in which case the $date_admission
column would be removed (i.e. not created) from the line list returned by sim_linelist()
.
If so I think the same logic and implementation could be applied to the onset_to_death
argument and resulting $date_death
column.
This change is slightly complicated by the interaction of onset_to_hosp
& onset_to_death
with the hosp_risk
, hosp_death_risk
and non_hosp_death_risk
arguments which would need to be cross checked to make sure the user isn't mis-specifying, but I don't think that's a major blocker of this change.
I do not have strong opinions on keeping with the current implementation or moving towards what you suggested. However, I see this issue an primarily an {epiparameter} issue. The infrastructure provided by {epiparameter} - the <epidist>
constructor function (epidist()
) - as well as the ability to specify an anonymous function to onset_to_hosp
and onset_to_death
in {simulist} mean that using sim_linelist()
should not depend on the {epiparameter} database, as users can easily specify their own distribution (<epidist>
or anonymous function).
I think this issue should also be made over in the {epiparameter} repository, and hopefully we as a team can contribute more entries in the database. This benefits both packages.
Better documentation on the {simulist} side may also be preferable to breaking changes of this sort and the added complexity of argument interactions.
Tagging @adamkucharski @sbfnk & @Bisaloo for opinions on both the code and whether epidemiologists regularly require line list data containing hospitalisation and death data.
Expect a common use case for simulist will be testing methods (e.g. calculating individual level delays to outcome, or aggregating to use with {cfr}, as well as Rt estimation etc.), so I think we need these options to be easily accessible to users. But agree that others may want to do a quick simulation of transmission chains without needing to specify a large number of parameters. Rather than changing data structure, would it be easier to just set columns like date_admission=NA
if user doesn't specify a delay distribution and/or risk?
Rather than changing data structure, would it be easier to just set columns like
date_admission=NA
if user doesn't specify a delay distribution and/or risk?
I think this is a good compromise. It would make it easier for users that did not have certain onset to event parameters and wouldn't change the dimensionality of the simulation output depending on the function arguments. This second point is beneficial when used in pipelines and the downstream pipeline expects a line list with a given number of columns and headers.
Another option, either in combination with setting arguments to NA
or alone, would be to allow uses to specify single numbers (i.e. scalars) for the arguments that currently require a distribution (either <epidist>
or anonymous function). This would produce a constant/deterministic delay time between events, so would not be very useful for simulating realistic line lists, but would provide a bridge for users who want to quickly simulate data but do not need the detail of parameterised distributions for: contact distribution, infectious period, onset-to-hospitalisation or onset-to-death.
Thoughts on this proposal would be good before I work on the implementation.
NA
where necessaryThanks all for input on this issue. I will move forward implementing the option to specify onset_to_hosp
and onset_to_death
arguments as NA
which will set the resulting $date_admission
and $date_death
columns to NA
.
The last discussion point for this feature, when the user inputs an NA
for onset_to_hosp
or onset_to_death
and leaves the input for hosp_risk
, hosp_death_risk
or non_hosp_death_risk
as a number (I plan to leave the default values as they are now, 0.2, 0.5, 0.05, respectively), should the function warn or error?
My preference is for it to warn with a message that the risks that the relevant risks are being ignored and should be set to NA
. This will add some noise to the output but it seems like a simple mistake for a user to make. The error will ensure that all the arguments are consistent/compatible by forcing the user to change the *_risk
argument from their default values but might put users off if it feels too complex.
A warning is slightly more common in this situation but happy to let you decide since both options are defensible.
I'm using simulist to simulate linelist data to use with {cfr}; at the moment the package requires you to specify an onset to hospitalisation delay and a column with hospitalisations is always created. However for this use case I don't really need this column, which has made me wonder whether it should be included as a requirement. Further, at the moment the epiparameter database only has data for COVID onset-hospitalisation, so it's unlikely that many users will be able to use this column unless they're working specifically with COVID.
For these reasons I'm raising this issue to start a discussion on whether hospitalisations should be required, and if the benefit outweighs the nuisance of almost always having to manually delete the column afterwards unless you want to simulate COVID data.