drop all counts with unknown outcome to estimate cfr from linelist data

avallecam commented 1 year ago

Following Ghani, 2005 and Lipsitch, 2015, to estimate CFR from linelist data it's suggested to include deaths and recoveries, but exclude all unknown outcomes.

This step still needs to exclude all unknown outcomes:

https://github.com/epiverse-trace/cfr/blob/66bfa793e6d2a6f51f8de5a1ad6056b74c54350a/vignettes/data_from_incidence2.Rmd#L70-L76

It is solved by adding one more filter to this step

ebola <- 

  ebola %>% 

  #' drop [unknown outcomes]
  #' to correct for right-censoring bias in denominator
  filter(!is.na(outcome)) %>%

  #' keep [known outcomes]
  #' - for "death" events, only the date of "outcome" [deaths]
  #' - for all events, the date of "onset" [cases]
  filter((outcome == "Death" & count_variable == "outcome") |
           (count_variable == "onset"))

I agree to make this drop and keep steps explicit for linelist data at this step.

Here is a reprex to compare against the current vignette output.

# Load packages
library(tidyverse)
library(outbreaks)
library(incidence2)
library(epiparameter)
library(cfr)

# load ebola dataset from outbreak
data("ebola_sim_clean")
ebola <- ebola_sim_clean$linelist

# view ebola
head(ebola)
#>   case_id generation date_of_infection date_of_onset date_of_hospitalisation
#> 1  d1fafd          0              <NA>    2014-04-07              2014-04-17
#> 2  53371b          1        2014-04-09    2014-04-15              2014-04-20
#> 3  f5c3d8          1        2014-04-18    2014-04-21              2014-04-25
#> 4  6c286a          2              <NA>    2014-04-27              2014-04-27
#> 5  0f58c4          2        2014-04-22    2014-04-26              2014-04-29
#> 6  49731d          0        2014-03-19    2014-04-25              2014-05-02
#>   date_of_outcome outcome gender           hospital       lon      lat
#> 1      2014-04-19    <NA>      f  Military Hospital -13.21799 8.473514
#> 2            <NA>    <NA>      m Connaught Hospital -13.21491 8.464927
#> 3      2014-04-30 Recover      f              other -13.22804 8.483356
#> 4      2014-05-07   Death      f               <NA> -13.23112 8.464776
#> 5      2014-05-17 Recover      f              other -13.21016 8.452143
#> 6      2014-05-07    <NA>      f               <NA> -13.23443 8.468572

# create incidence2 object of ebola deaths
ebola <- incidence(
  x = ebola,
  date_index = c(
    onset = "date_of_onset",
    outcome = "date_of_outcome"
  ),
  groups = "outcome"
)

# filter for outcomes that are deaths using dplyr::filter --- death counts
# also filter for all onsets --- these are the case counts

# [blocked] instead of this one-step filter:

# ebola <- filter(
#   ebola,
#   (outcome == "Death" & count_variable == "outcome") |
#     (count_variable == "onset")
# )

# [replaced by] this two-step filter:

ebola <- 

  ebola %>% 

  #' drop [unknown outcomes]
  #' to correct for right-censoring bias in denominator
  filter(!is.na(outcome)) %>%

  #' keep [known outcomes]
  #' - for "death" events, only the date of "outcome" [deaths]
  #' - for all events, the date of "onset" [cases]
  filter((outcome == "Death" & count_variable == "outcome") |
           (count_variable == "onset"))

# remove groups using incidence2::regroup()
ebola <- regroup(ebola)

# prepare data
ebola <- prepare_data(
  ebola,
  cases_variable = "onset",
  deaths_variable = "outcome",
  fill_NA = TRUE
)

onset_to_death_ebola <- epidist_db(
  disease = "Ebola Virus Disease",
  epi_dist = "onset_to_death",
  author = "Barry_etal"
)
#> Using Barry et al. (2018) <10.1016/S0140-6736(18)31387-4> PMID: 30047375. 
#> To retrieve the short citation use the 'get_citation' function

# estimate static CFR as a sanity check
estimate_static(
  ebola,
  correct_for_delays = TRUE, 
  epidist = onset_to_death_ebola
)
#>   severity_me severity_lo severity_hi
#> 1       0.475       0.455       0.495

^{Created on 2023-08-22 with reprex v2.0.2}

adamkucharski commented 1 year ago

I don't think removal of unknowns is appropriate if we're calculating CFRs based on incidence data (with the linelist data only processed as a step to generate this incidence). The Ghani et al paper is based on individual outcomes via survival analysis approaches, so outcome type does matter (see also issue #7), whereas the incidence-based analysis adjusts for as-yet-unknown outcomes in real time, so removing these unknowns would probably bias estimates (as it seems to in the estimate_static() calculation above)

Carmen and I put together some case studies that hopefully illustrate the linelist estimation vs incidence estimation issues more clearly: https://github.com/CarmenTamayo/Applications-Epiverse-pipelines/blob/ak-edits/Marburg_underreporting.Rmd

pratikunterwegs commented 1 year ago

Thanks @avallecam and @adamkucharski - just to clarify, is this a feature we need to add in some way? We do already allow replacing NA with zeros in prepare_data.incidence2(). My understanding is we're currently okay as things are?

pratikunterwegs commented 6 months ago

Closing this as {cfr} is not intended to work with linelist data. Users can/should convert their linelists to incidence data before using it with {cfr}.

epiverse-trace / cfr

drop all counts with unknown outcome to estimate cfr from linelist data #79