`pathfinder` fails for large case reports

jamesmbaazam commented 1 month ago

In my bid to get pathfinder running for the vignette in #695, I realised that it was quite likely failing because of the extremely large case reports in the simulated data (approx 40M at peak), so I decided to run a few data scenarios to see how the method performs with different case numbers. My suspicion has so far been confirmed, as the method performs well with case numbers between 100 and 10K, but starts to fail from there. Also notice that the credible intervals get narrower as the data magnitude increases. I am wondering why this is the case from the theoretical perspective, and if there is a way to improve the method to handle such large case numbers.

The log is quite rich, so I've attached it here for further discussion. pathfinder_scenarios_log.txt

library(EpiNow2)
#> 
#> Attaching package: 'EpiNow2'
#> The following object is masked from 'package:stats':
#> 
#>     Gamma
library(data.table)
library(purrr)
#> 
#> Attaching package: 'purrr'
#> The following object is masked from 'package:data.table':
#> 
#>     transpose
library(ggplot2)

# Function to simulate data scenarios
simulate_epidemic <- function(days, date_start, peak_day, max_cases, spread, seed) {
  set.seed(seed)
  # Calculate the cases for each day
  days_vec <- seq_len(days)
  cases <- max_cases * exp(-((days_vec - peak_day)^2) / (2 * spread^2))

  # Create a data frame for the result
  epidemic_curve <- data.frame(
    date = as.Date(date_start) + days_vec,
    confirm = ceiling(cases)
  )
  return(epidemic_curve)
}

# Scenarios for epidemi peaks
peaks <- c(100, 100E1, 100E2, 100E3, 100E4, 100E5)

# Create version of epinow that always succeeds
safe_epinow <- purrr::safely(epinow)
# Define epinow() inputs
## Generation time
generation_time <- Gamma(
  shape = Normal(1.3, 0.3),
  rate = Normal(0.37, 0.09),
  max = 14
)
# Run the simulations
pf_results <- lapply(peaks, function(peak) {
  test_data <- simulate_epidemic(
    seed = 123,
    days = 365,
    date_start = "2020-02-22",
    peak_day = 125,
    max_cases = peak,
    spread = 20
  )

  safe_epinow(
    data = as.data.table(test_data)[confirm > 1], # only the data is changing
    generation_time = generation_time_opts(generation_time),
    stan = stan_opts(method = "pathfinder", backend = "cmdstanr"),
    verbose = FALSE,
    logs = NULL
  )
})
# Annotate the results
names(pf_results) <- paste0("peak_", peaks)

# Scenarios that failed will return TRUE
sapply(pf_results, function(x) is.null(x$result))
#>   peak_100  peak_1000 peak_10000 peak_1e+05 peak_1e+06 peak_1e+07 
#>      FALSE      FALSE      FALSE       TRUE       TRUE       TRUE

# Plot those that succeeded
sapply(pf_results, function(x) {
  if (!is.null(x$result)) {
    x$result$plots$summary
  }
})
#> $peak_100

#> 
#> $peak_1000

#> 
#> $peak_10000

#> 
#> $`peak_1e+05`
#> NULL
#> 
#> $`peak_1e+06`
#> NULL
#> 
#> $`peak_1e+07`
#> NULL

^{Created on 2024-07-24 with reprex v2.1.1}

seabbs commented 3 weeks ago

Oh interesting. I assume this is because of numerical overruns for very large numbers (i.e because the prior distribution is multiplier case values a incidence of 40 million might allow some prior values to be 10^3 * 40 million) I would guess we could improve this by improving the prior distribution

seabbs commented 3 weeks ago

or usinng a different default model where you enforce some concept of population size.

seabbs commented 3 weeks ago

Did you look at see if NUTs has the same scaling problems? I would guess it would do but maybe not? If not that is interesting

jamesmbaazam commented 1 week ago

Sorry, I missed this message. I'll try with NUTS and see.

epiforecasts / EpiNow2

`pathfinder` fails for large case reports #728