Remove normal approximation

adamkucharski commented 4 months ago

Please check if the PR fulfills these requirements
[X] I have read the CONTRIBUTING guidelines
[X] A new item has been added to NEWS.md
[x] Tests for the changes have been added (for bug fixes / features)
[X] Docs have been added / updated (for bug fixes / features)
[x] Checks have been run locally and pass
What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)

This addresses issues #152 and #151

What is the current behavior? (You can also link to an open issue here)

Instability with normal approximation in Ebola example

What is the new behavior (if this is a feature change)?

Normal approximation removed. This PR also updates tests for consistency with the removed functionality.

There are two additional changes:

Example in the README now focuses on the early outbreak stage, where the CFR bias is greater and hence the importance of {cfr} functionality is more clearly illustrated.
When we have the condition expected outcomes to date < total deaths, we now return NA for all estimates, to make it clearer to the user that this would not be a statistically valid calculation.
Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

No.

adamkucharski commented 4 months ago

Note: need to review the test functions to ensure consistency with updated messages and outputs.

Bisaloo commented 4 months ago

Could you run styler::style_pkg() in the folder containing the package please? This will ensure consistency in the indentation and make it easier to read & review the code.

adamkucharski commented 3 months ago

I've run styler::style_pkg() (latest commit). Is this something we should also incorporate into this issue on PR guidelines? Would it also make easier for users to pass the linter checks?

avallecam commented 3 months ago

sorry for the delayed reply. Just created this reprex to reproduce the README figure.

This PR successfully solved the issue of the point estimate and confidence interval. I would only add that this reprex also diagnoses that there are sections of the time series that do not generate an output, generating no estimates for certain date ranges, which are visible in the plot.

# pak::pak("epiverse-trace/cfr@remove-normal")

# Load package
library(cfr)
library(ggplot2)

# Calculate the static CFR without correcting for delays
cfr_static(data = ebola1976)
#>   severity_estimate severity_low severity_high
#> 1          0.955102    0.9210866     0.9773771

# Calculate the CFR without correcting for delays on each day of the outbreak
rolling_cfr_naive <- cfr_rolling(
  data = ebola1976
)
#> `cfr_rolling()` is a convenience function to help understand how additional data influences the overall (static) severity. Use `cfr_time_varying()` instead to estimate severity changes over the course of the outbreak.

# Calculate the rolling daily CFR while correcting for delays
rolling_cfr_corrected <- cfr_rolling(
  data = ebola1976,
  delay_density = function(x) dgamma(x, shape = 2.40, scale = 3.33)
)
#> `cfr_rolling()` is a convenience function to help understand how additional data influences the overall (static) severity. Use `cfr_time_varying()` instead to estimate severity changes over the course of the outbreak.
#> Some daily ratios of total deaths to total cases with known outcome are below 0.01%: some CFR estimates may be unreliable.FALSE

# combine the data for plotting
rolling_cfr_naive$method <- "naive"
rolling_cfr_corrected$method <- "corrected"

data_cfr <- rbind(
  rolling_cfr_naive,
  rolling_cfr_corrected
)

# visualise both corrected and uncorrected rolling estimates
ggplot(data_cfr) +
  geom_ribbon(
    aes(
      date,
      ymin = severity_low, ymax = severity_high,
      fill = method
    ),
    alpha = 0.2, show.legend = FALSE
  ) +
  geom_line(
    aes(date, severity_estimate, colour = method)
  ) +
  scale_colour_brewer(
    palette = "Dark2",
    labels = c("Corrected CFR", "Naive CFR"),
    name = NULL
  ) +
  scale_fill_brewer(
    palette = "Dark2"
  )
#> Warning: Removed 1 row containing missing values or values outside the scale range
#> (`geom_line()`).

^{Created on 2024-07-11 with reprex v2.1.0}

adamkucharski commented 3 months ago

Thanks for looking at this. Currently we have some situations where E(known outcomes) < deaths, and hence the likelihood as currently implemented isn't totally valid. One option would be to set E(known outcomes) = deaths in this situation (which some earlier versions of the code did). But thinking more, seemed clearest to return NA in this situation. I've outlined a potential longer-term solution in issue #154 (but this would require some more thought and potentially quite a lot of refactoring).

avallecam commented 3 months ago

That is interesting; thank you for adding an explicit explanation. In the meantime, would it be valid to get the message as a warning and, instead of NA, provide the latest or most recent estimated output at a given date?

adamkucharski commented 3 months ago

It would require quite a lot of additional refactoring to output the last valid estimate, as the README example is a showcase of multiple estimates at each point in time in a visualisation (i.e. a loop over cfr_static() ), rather than the output user will commonly interact with (i.e. a numerical estimate).

The current message is displayed above (e.g. Total deaths = 140 and expected outcomes = 134 so setting expected outcomes = NA. If we were to assume total deaths = expected outcomes, it would produce an estimate of 1.) But could edit if there's a better option?

avallecam commented 3 months ago

The current message is displayed above (e.g. Total deaths = 140 and expected outcomes = 134 so setting expected outcomes = NA. If we were to assume total deaths = expected outcomes, it would produce an estimate of 1.) But could edit if there's a better option?

The current message displayed is appropriate and specific. This reflects that this is produced in an extreme scenario, as described in #154. Given that this PR already solved the key issue, I'll move on with the approval.

As a complementary comment, I suggest adding to the message an explicit next step for the user. If, in an ongoing outbreak, we create a reproducible sitrep and suddenly get an NA result, then we can have a companion plot from cfr_rolling() to have the full view.

We could add sth like:

Use `cfr_rolling()` to understand how additional data influences the overall (static) severity.

adamkucharski commented 3 months ago

Thanks, have merged and will create an issue with the above rolling suggestion.

epiverse-trace / cfr

Remove normal approximation #153