Output results for full duration of the pandemic?

tomwenseleers commented 1 year ago

Thanks for the nice package! Was just wondering if it would be a lot of hassle to output the results for the full duration of the pandemic in https://github.com/epiforecasts/inc2prev/tree/master/outputs? Just asking because I was looking for a good source of estimated new infections per day to map variant lineage frequencies on, and this would make a very good source...

sbfnk commented 1 year ago

The estimates in that directory actually start in April 2020 (unlike the plots in the report, which only show the last 12 months of estimates), whch is when the community infection survey started - we don't have anything from before then I'm afraid.

seabbs commented 1 year ago

Thanks!

Just a fly by ot mention the limitations of using estimates over such a long time span. These are fully listed in the preprint but are briefly:

Assumes a fixed PCR detectability curve for all variants and across all time/risk strata etc.
Assumes stationarity in the underlying infection process (i.e things change with the same kernel across the pandemic). This is likely an okay approximation but does mean that if you fit to a period with lots of rapid NPIs vs mostly immune dominated (and so gradual growth rate changes) you might end up with slightly different estimates vs fitting all at once.
We fit to aggregate data released by ONS and so have to assume independence between data points. This means that we allow some infection/prevalence trajectories that the full data would probably indicate are implausible.

tomwenseleers commented 1 year ago

Many thanks! Obviously should have looked better in https://raw.githubusercontent.com/epiforecasts/inc2prev/master/outputs/estimates_national.csv - had assumed it was clipped to the same date range as the plots. So I'll close this again. :-)

Any thoughts what might be the most reasonable/informed approach for estimated infections before April 2020? Is there anything in the public domain that you know of? I see the IHME infection estimates for England start 1st of March, so maybe I can use those for March 2020? Are they any good you think? Or anything else in the public domain? ihme = bind_rows(read_csv("https://ihmecovid19storage.blob.core.windows.net/archive/2022-12-16/data_download_file_reference_2020.csv"), read_csv("https://ihmecovid19storage.blob.core.windows.net/archive/2022-12-16/data_download_file_reference_2021.csv"), read_csv("https://ihmecovid19storage.blob.core.windows.net/archive/2022-12-16/data_download_file_reference_2022.csv"), read_csv("https://ihmecovid19storage.blob.core.windows.net/archive/2022-12-16/data_download_file_reference_2023.csv"))

Ha and somewhat unrelated question: I was looking for a good way to convert wastewater SARS-CoV2 RNA concentrations through time to incidences for the US (using the Biobot data from the US, https://github.com/biobotanalytics/covid19-wastewater-data). Am I correct that that just comes down to an appropriately scaled deconvolution of the wastewater RNA concentration with the shedding load distribution, as in https://ehp.niehs.nih.gov/doi/10.1289/EHP10050? (For deconvolution I was using a nonnegative weighted least squares regression, using time shifted copies of the convolution kernel as covariate matrix and using a fused-ridge penalty on 1st & 2nd order finite differences, to get a nice smooth curve back)

seabbs commented 1 year ago

I'm not sure what to use for March 2020. I haven't kept up but would just assume most estimates are quite rough.

Am I correct that that just comes down to an appropriately scaled deconvolution of the wastewater RNA concentration with the shedding load distribution

I think just is doing a lot of work there (😆) but definitely a simple model would just be that (in that paper I believe they reformulate things to fit into their case $R_t$ pipeline which seems a little hacky. Obviously, there is a lot more going on than that captures and I assume things like rainfall lead to issues.

Method itself sounds quite solid to me.

sbfnk commented 1 year ago

Any thoughts what might be the most reasonable/informed approach for estimated infections before April 2020?

If it's the cumulative by April you're interested in you could use our own estimates which are based on serological data froom blood donors at the time, i.e. first data point here and in similarly named files. You might even be able to use hospitalisation data to infer an infection trajectory up to then, but given the scarcity of high-quality data from the time whichever model you'd use will have to do most of the work.

epiforecasts / inc2prev

Output results for full duration of the pandemic? #43