Closed BvdM0 closed 4 years ago
Hi there,
This sounds like interesting work. It looks like this has happened because we have an internal minimum case threshold to estimate Rt: the model won't run if there are <40 cases on a single day. The reason we have the threshold is that estimates from low case counts are extremely uncertain.
However you can change this by including the case_limit
argument in the pipeline function:
regional_rt_pipeline(cases = casesUK,
delay_defs = delay_dist,
target_folder = target_dir,
case_limit = 0 # Set to 0 to get results for all regions regardless of case count
)
To test this with a quicker run time, I'd suggest limiting to just one or two regions, and/or limiting the time-series so you don't use the entire data.
Hope this helps: let me know if not. We'd be interested in hearing how this goes if you make any of your results public.
Best, Kath
Thanks, that did solve it! Strangely, the model was running fine with most of the days being <40 new cases - it was just a particular day that seemed to matter.
On the case limit: I'm planning on using these estimates for journalistic purposes, so I don't want to mislead. Is the additional uncertainty for estimates from small case counts priced into the existing confidence intervals? In other words, is it fine to report them so long as the confidence intervals are also reported?
I will of course let you know and cite you if these results are made public!
Best, Ben
Great, glad that worked Ben.
For reporting, the real problem is that the method breaks down with low case counts. A feature of this is the huge uncertainty range, which as you suggest, is reported in the confidence intervals. More than that, though, these estimates are just much more unstable, while it gets harder to assess if the model is correctly calibrated. This means we have no way to tell if even the estimated uncertainty (however wide) is a good representation of the true uncertainty.
Because of this, we don't report regions with case counts <40 on our website (e.g. many regions of Afghanistan). In general, I'd say that showing estimates from areas with low case counts on a par with estimates from high case counts wouldn't properly reflect that the latter are far more reliable than the former.
Do just say if there is anything else that could be clearer - we are looking for ways to improve how we communicate our methods (and results)!
Kath Copying in @seabbs and @sbfnk after discussing case limits together.
Thank you, that's really helpful!
Hi, I ran the rt pipeline on UK case data and approximately 1/3rd of local authorities were missing from the results. I tried comparing them to the successful districts, but the data seemed the same (no NAs, same classes). Trial and error on two failed districts showed that it was a problem with the case count column ('confirm'), and specifically a problem with one datapoint in each case:
For Somerset, the 18th April recorded 14 cases. The Somerset data was only able to be passed through the pipeline successfully if this datapoint (18th April) was changed to be any number >=40 (i.e. 39 failed, 9999 succeeded)
For Camden, the 13th April recorded 13 cases. Again, changing this to be >=40 was the only way to pass it through the pipeline successfully.
Any idea what could be behind this?
Thanks
NB. Full code provided below, but running the pipeline on the full dataset takes >10 hours so if you're trying to replicate my issue I'd suggest skipping everything from "future::plan("multiprocess")" to "Trying to find the issue"
Save files
library(jsonlite) saveRDS(r,"r.RDS") write_json(r,"r.JSON") write.csv(notFound,"notFound.csv")
I can see here that the script missed some LAs, but all are picked up by notFound
These can be run again
nrow(data.frame(table(r$region))) + nrow(notFound) - nrow(regions)
Run again, to pick up the missing files
dataNotFound <- casesUK[casesUK$region %in% notFound,] nrow(data.frame(table(dataNotFound$region))) - nrow(notFound)
dNF_na <- dataNotFound[rowSums(is.na(dataNotFound)) > 0,]
table(dataNotFound$date) table(r$date) table(casesUK$date[casesUK$region == "Kingston Upon Hull, City of"])
Trying to find the issue
dataKent <- subset(casesUK,casesUK$region == "Kent") # Did work dataBarnet <- subset(casesUK,casesUK$region == "Barnet") # Did work
dataSomerset1 <- subset(casesUK,casesUK$region == "Somerset") # Didn't work dataCamden <- subset(casesUK,casesUK$region == "Camden") # Didn't work
I find it's due to confirm column:
dataKent <- dataKent[1:112,] # Still working dataKent$date <- dataSomerset$date # Still working dataKent$confirm <- dataSomerset$confirm # STOPS working dataSomerset$confirm <- dataKent$confirm # STARTS working dataSomerset$confirm <- c(1:112) # STARTS working dataSomerset$confirm <- c(0:61,50:1) # STARTS working dataSomerset <- subset(casesUK,casesUK$region == "Somerset") cSom <- dataSomerset$confirm dataSomerset$confirm <- cSom # Didn't work
Narrow the problem for Somerset
dataSomerset$confirm <- c(cSom[1:73],1:39) # NOT working dataSomerset$confirm <- c(cSom[1:72],1:40) # Working dataSomerset$confirm <- c(cSom[1:72],14,1:39) # Not working dataSomerset$confirm <- c(cSom[1:72],14,cSom[74:112]) # Not working dataSomerset$confirm <- c(cSom[1:72],73,cSom[74:112]) # Working dataSomerset$confirm <- c(cSom[1:72],72,cSom[74:112]) # Working dataSomerset$confirm <- c(cSom[1:72],40,cSom[74:112]) # Working dataSomerset$confirm <- c(cSom[1:72],39,cSom[74:112]) # NOT working dataSomerset$confirm <- c(cSom[1:72],1120,cSom[74:112]) # Working
18th April was 14 but needs to be 40 or above
Repeat for Camden
cCam <- dataCamden$confirm dataCamden$confirm <- cCam # Didn't work dataCamden$confirm <- c(cCam[1:61],1:40) # Working dataCamden$confirm <- c(cCam[1:62],1:39) # NOT working dataCamden$confirm <- c(cCam[1:62],13,cCam[64:101]) # NOT working dataCamden$confirm <- c(cCam[1:62],40,cCam[64:101]) # Working dataCamden$confirm <- c(cCam[1:62],39,cCam[64:101]) # Not working
13th April is 13 but needs to be above 40`