epiforecasts / evaluate-delta-for-forecasting

Evaluating the impact of modelling strain dynamics on short-term COVID-19 forecast performance
Other
1 stars 0 forks source link

Aggregation of notification data #17

Open seabbs opened 2 years ago

seabbs commented 2 years ago

Case and sequence notification data is available at different levels of aggregation.

Case data is generally available daily but when used as a target for the forecasting hubs is aggregated to the weekly level for weeks ending on Saturday (i.e backwards-looking aggregates).

Sequence data from the RKI and GISAID (via covariants.org) is available aggregated to weeks only with each week ending on Sunday.

Currently, I am assuming maintaining comparability with the ECDC and German/Poland forecasting hubs is paramount and am therefore crudely adjusting the sequence data to be referenced to Saturday as the end of the week (i.e by literally subtracting one day from the label and making no other changes). This obviously introduces a small amount of bias but as above maintains the link to real-time forecasts.

An obvious alternative is to aggregate case notifications to the same week definition as used in the sequence data but this will break the link with hub forecasts. Thoughts @jbracher @sbfnk? I am minded to stick with keeping the link with the hubs and accepting the small amount of bias as a cost of that (perhaps a useful point for discussion if we can't think of a better solution).

sbfnk commented 2 years ago

Do the dates even refer to the same thing? As far as I'm aware the JHU data used in the hubs usually refers to the day on which a case appears in the official statistics, whereas I imagine the GISAID data might be by date of specimen collection or date of sequencing?

seabbs commented 2 years ago

No, they definitely doesn't refer to the same thing so yes that is already an unmentioned limitation/approximation. Agree on date definitions and am not clear where the information is for the GISAID after it's been processed for covariants.org (though I thought it was the date of specimen collection which would be the closest to case report presumably but can't find anything to support that).

Again I would be tempted to stick with the simple approximation and assume they do have the same reference date (within a week at least) but open to other suggestions and/or digging deeper. I'd probably stay away from introducing additional model complexity to account for this but I suppose that could also be an option.

seabbs commented 2 years ago

From covariants.org:

"What date is used on the graphs? The dates used are always the dates a sample was taken. Only samples with a sampling day, month, and year are included in CoVariants, to ensure accuracy."

-> so sequences are by date of sample collection not sequencing or report.

https://covariants.org/faq#where-can-i-get-the-data

seabbs commented 2 years ago

@jbracher ping for thoughts.

sbfnk commented 2 years ago

OK, so assuming it takes ~2 days from collection to result/report a GISAID Mon-Sun corresponds to cases Wed-Tue. Annoying that it falls right in the middle but I agree that allocating either to previous or next week makes most sense - or possibly linearly interpolate between the two.