epiforecasts / EpiNow

Estimate Realtime Case Counts and Time-varying Epidemiological Parameters
https://epiforecasts.io/EpiNow
Other
33 stars 17 forks source link

Strange error for some regions in regional rt pipeline #99

Closed BvdM0 closed 4 years ago

BvdM0 commented 4 years ago

Hi, I ran the rt pipeline on UK case data and approximately 1/3rd of local authorities were missing from the results. I tried comparing them to the successful districts, but the data seemed the same (no NAs, same classes). Trial and error on two failed districts showed that it was a problem with the case count column ('confirm'), and specifically a problem with one datapoint in each case:

Any idea what could be behind this?

Thanks

NB. Full code provided below, but running the pipeline on the full dataset takes >10 hours so if you're trying to replicate my issue I'd suggest skipping everything from "future::plan("multiprocess")" to "Trying to find the issue"

casesUK <- read.csv(url("https://coronavirus.data.gov.uk/downloads/csv/coronavirus-cases_latest.csv"))
casesUK <- subset(casesUK,casesUK$Area.type=="Upper tier local authority")
colnames(casesUK)[1] <- "region"
colnames(casesUK)[4] <- "date"
colnames(casesUK)[5] <- "confirm"
casesUK <- cbind(casesUK$date,casesUK$confirm,casesUK$region)
colnames(casesUK) <- c("date","confirm","region")
casesUK <- data.frame(casesUK)
casesUK$import_status <- "local"
casesUK <-casesUK[,c(1,2,4,3)]
casesUK$confirm <- as.integer(casesUK$confirm)
casesUK$date <- as.Date(casesUK$date)

library(EpiNow)
target_dir <- file.path(tempdir(), "test1")

delay_dist <- readRDS(url("https://github.com/epiforecasts/covid-global/blob/c373e86c9c196f1d244e9ab45c0cb4b3fb9b23f0/national/United%20Kingdom/latest/delays.rds?raw=true"))

future::plan("multiprocess")
regional_rt_pipeline(cases = casesUK,  # ran this at ~16.40 on 8th July and it was done by 06.50 the next morning
                     delay_defs = delay_dist,
                     target_folder = target_dir)

list.files(target_dir, recursive = TRUE)

regions <- data.frame(table(casesUK$region))
dates <- data.frame(table(casesUK$date))
most_recent <- as.Date(droplevels(dates[nrow(dates),"Var1"]))
r <- NULL
notFound <- NULL

for (i in 1:nrow(regions)) {
  filename <- NULL
  filename <- paste0(target_dir,"/",regions[i,"Var1"],"/",most_recent,"/","time_varying_params.rds")
  if(file.exists(filename)==TRUE){
    time_varying_params <- readRDS(filename)
    localr <- time_varying_params$R0
    localr$region <- as.character(droplevels(regions[i,"Var1"]))
    r <- rbind(r,localr)
  } else {
    notFound <- rbind(notFound,as.character(droplevels(regions[i,"Var1"])))
  }
}

Save files

library(jsonlite) saveRDS(r,"r.RDS") write_json(r,"r.JSON") write.csv(notFound,"notFound.csv")

I can see here that the script missed some LAs, but all are picked up by notFound

These can be run again

nrow(data.frame(table(r$region))) + nrow(notFound) - nrow(regions)

Run again, to pick up the missing files

dataNotFound <- casesUK[casesUK$region %in% notFound,] nrow(data.frame(table(dataNotFound$region))) - nrow(notFound)

dNF_na <- dataNotFound[rowSums(is.na(dataNotFound)) > 0,]

table(dataNotFound$date) table(r$date) table(casesUK$date[casesUK$region == "Kingston Upon Hull, City of"])

Trying to find the issue

dataKent <- subset(casesUK,casesUK$region == "Kent") # Did work dataBarnet <- subset(casesUK,casesUK$region == "Barnet") # Did work

dataSomerset1 <- subset(casesUK,casesUK$region == "Somerset") # Didn't work dataCamden <- subset(casesUK,casesUK$region == "Camden") # Didn't work

I find it's due to confirm column:

dataKent <- dataKent[1:112,] # Still working dataKent$date <- dataSomerset$date # Still working dataKent$confirm <- dataSomerset$confirm # STOPS working dataSomerset$confirm <- dataKent$confirm # STARTS working dataSomerset$confirm <- c(1:112) # STARTS working dataSomerset$confirm <- c(0:61,50:1) # STARTS working dataSomerset <- subset(casesUK,casesUK$region == "Somerset") cSom <- dataSomerset$confirm dataSomerset$confirm <- cSom # Didn't work

Narrow the problem for Somerset

dataSomerset$confirm <- c(cSom[1:73],1:39) # NOT working dataSomerset$confirm <- c(cSom[1:72],1:40) # Working dataSomerset$confirm <- c(cSom[1:72],14,1:39) # Not working dataSomerset$confirm <- c(cSom[1:72],14,cSom[74:112]) # Not working dataSomerset$confirm <- c(cSom[1:72],73,cSom[74:112]) # Working dataSomerset$confirm <- c(cSom[1:72],72,cSom[74:112]) # Working dataSomerset$confirm <- c(cSom[1:72],40,cSom[74:112]) # Working dataSomerset$confirm <- c(cSom[1:72],39,cSom[74:112]) # NOT working dataSomerset$confirm <- c(cSom[1:72],1120,cSom[74:112]) # Working

18th April was 14 but needs to be 40 or above

Repeat for Camden

cCam <- dataCamden$confirm dataCamden$confirm <- cCam # Didn't work dataCamden$confirm <- c(cCam[1:61],1:40) # Working dataCamden$confirm <- c(cCam[1:62],1:39) # NOT working dataCamden$confirm <- c(cCam[1:62],13,cCam[64:101]) # NOT working dataCamden$confirm <- c(cCam[1:62],40,cCam[64:101]) # Working dataCamden$confirm <- c(cCam[1:62],39,cCam[64:101]) # Not working

13th April is 13 but needs to be above 40`

kathsherratt commented 4 years ago

Hi there,

This sounds like interesting work. It looks like this has happened because we have an internal minimum case threshold to estimate Rt: the model won't run if there are <40 cases on a single day. The reason we have the threshold is that estimates from low case counts are extremely uncertain.

However you can change this by including the case_limit argument in the pipeline function:

regional_rt_pipeline(cases = casesUK,
                     delay_defs = delay_dist,
                     target_folder = target_dir,
                     case_limit = 0 # Set to 0 to get results for all regions regardless of case count
                     )

To test this with a quicker run time, I'd suggest limiting to just one or two regions, and/or limiting the time-series so you don't use the entire data.

Hope this helps: let me know if not. We'd be interested in hearing how this goes if you make any of your results public.

Best, Kath

BvdM0 commented 4 years ago

Thanks, that did solve it! Strangely, the model was running fine with most of the days being <40 new cases - it was just a particular day that seemed to matter.

On the case limit: I'm planning on using these estimates for journalistic purposes, so I don't want to mislead. Is the additional uncertainty for estimates from small case counts priced into the existing confidence intervals? In other words, is it fine to report them so long as the confidence intervals are also reported?

I will of course let you know and cite you if these results are made public!

Best, Ben

kathsherratt commented 4 years ago

Great, glad that worked Ben.

For reporting, the real problem is that the method breaks down with low case counts. A feature of this is the huge uncertainty range, which as you suggest, is reported in the confidence intervals. More than that, though, these estimates are just much more unstable, while it gets harder to assess if the model is correctly calibrated. This means we have no way to tell if even the estimated uncertainty (however wide) is a good representation of the true uncertainty.

Because of this, we don't report regions with case counts <40 on our website (e.g. many regions of Afghanistan). In general, I'd say that showing estimates from areas with low case counts on a par with estimates from high case counts wouldn't properly reflect that the latter are far more reliable than the former.

Do just say if there is anything else that could be clearer - we are looking for ways to improve how we communicate our methods (and results)!

Kath Copying in @seabbs and @sbfnk after discussing case limits together.

BvdM0 commented 4 years ago

Thank you, that's really helpful!