A common user need for those who work with linelist data will be to estimate delay distributions then store for future analysis. So could be useful to have a vignette describing key choices/caveats for doing this, as well as showcasing how to get the result into {epiparameter} format. Some example code is provided below showing calculation of onset-to-death from simulated truncated data during an increasing epidemic, under two approximate (but fast and easy to implement) estimation methods:
Naive (i.e. ignore truncation and just analyse all known onset-to-death delay as of current date). This would typically bias estimates downwards, especially during an increasing epidemic in real-time.
Cohort-approach (i.e. omit people who had recent onset dates, to construct a retrospective cohort). This would reduce bias, but increase uncertainty (because more data points omitted).
Simulating 500 onsets with an epidemic that grows at 10% per day (black points below show true simulated distribution of onset-to-death), the naive estimate has 117 data points to fit to (red line) and the cohort estimate has 52 data points (blue line).
We could also show how the user could apply more advanced options, like {epidist} package, which would be particularly appropriate for small datasets where the cohort-approach is omitting too much data and hence introducing too much uncertainty.
Could also incorporate an advanced section with specific case studies for more complex (but common) data issues and fast assumptions that may be required, like those described in this issue for {simulist} and some of the case studies @CarmenTamayo has been working on with global.health example data.
There may also be an eventual home for some of these ideas in {epipipe} or similar package, e.g. linking with {linelist} and {cleanepi}.
Code for simulations (this could probably be done much more efficiently using {simulist}, which would also highlight nice integration potential between these packages):
library(epiparameter)
library(fitdistrplus)
# Define 50 day window for possible infections
max_day <- 50
time_x <- 1:max_day
# Assume epidemic growing at 10% per day
# Recent infections more likely in growing epidemic
growing_epidemic <- exp(0.1*time_x)
growing_epidemic_prob <- growing_epidemic/sum(growing_epidemic)
# Simulate onset times
sample_onsets <- 500
set.seed(10)
onset_times <- sample(time_x,size=sample_onsets,replace=T,prob=growing_epidemic_prob)
onset_times <- sort(onset_times)
# Sample onset to death times
onset_to_death_period_in <-
epiparameter::epidist_db(disease = "covid",epi_dist = "onset to death",single_epidist = T) |>
discretise()
death_times <- (generate(onset_to_death_period_in,200) + onset_times) |> ceiling()
# Naive calculation for real-time onset to death on day 50
# I.e. higher bias, lower uncertainty method
death_times_rt <- death_times
death_times_rt[death_times_rt>max_day] <- NA # Remove truncated values not available in real-time
onset_to_death_naive <- death_times_rt-onset_times
onset_to_death_naive <- onset_to_death_naive[!is.na(onset_to_death_naive)] # remove NA
# Note fit to continuous distribution for simplicity
param_naive <- fitdist(data = onset_to_death_naive, distr = "lnorm")
# Cohort calculation for real-time onset to death on day 50
# I.e. lower bias, higher uncertainty method
cut_off_range <- 20
death_times_rt <- death_times
death_times_rt[death_times_rt>max_day] <- NA # Remove truncated values not available in real-time
death_times_rt[onset_times>(max_day - cut_off_range)] <- NA # Remove early onsets (i.e. <20 days to current point)
onset_to_death_cohort <- death_times_rt-onset_times
onset_to_death_cohort <- onset_to_death_cohort[!is.na(onset_to_death_cohort)] # remove NA
param_cohort <- fitdist(data = onset_to_death_cohort, distr = "lnorm")
# Compare fits with original distribution
x_val <- 0:40
y_val <- density(onset_to_death_period_in,x_val)
par(mar=c(4,4,1,1),mgp=c(2,0.7,0),las=0)
plot(x_val, y_val, xlab = "days",ylab="probability",ylim=c(0,0.1))
lines(x_val,dlnorm(x_val, meanlog = param_naive$estimate["meanlog"], sdlog = param_naive$estimate["sdlog"]),
col = "red", lwd = 2)
lines(x_val,dlnorm(x_val, meanlog = param_cohort$estimate["meanlog"], sdlog = param_cohort$estimate["sdlog"]),
col = "blue", lwd = 2)
A common user need for those who work with linelist data will be to estimate delay distributions then store for future analysis. So could be useful to have a vignette describing key choices/caveats for doing this, as well as showcasing how to get the result into {epiparameter} format. Some example code is provided below showing calculation of onset-to-death from simulated truncated data during an increasing epidemic, under two approximate (but fast and easy to implement) estimation methods:
Simulating 500 onsets with an epidemic that grows at 10% per day (black points below show true simulated distribution of onset-to-death), the naive estimate has 117 data points to fit to (red line) and the cohort estimate has 52 data points (blue line).
There's some discussion of continuous vs discrete distributions here (for longer delays, it will be less of an issue to interchange them): https://github.com/epiverse-trace/cfr/issues/115
We could also show how the user could apply more advanced options, like {epidist} package, which would be particularly appropriate for small datasets where the cohort-approach is omitting too much data and hence introducing too much uncertainty.
Could also incorporate an advanced section with specific case studies for more complex (but common) data issues and fast assumptions that may be required, like those described in this issue for {simulist} and some of the case studies @CarmenTamayo has been working on with global.health example data.
There may also be an eventual home for some of these ideas in {epipipe} or similar package, e.g. linking with {linelist} and {cleanepi}.
Code for simulations (this could probably be done much more efficiently using {simulist}, which would also highlight nice integration potential between these packages):