load_observations() returns too many deaths

nikosbosse commented 3 years ago

the function load_observations() which reads the case data from here obs <- fread(here("models", "rt", "data", "summary", target_date, "reported_cases.csv")) returns too many deaths for the US as a whole.

Numbers in the data look different from the ones on Ourworldindata and also differ from the ones returned by get_us_deaths() They do, however, look similar to the ones shown on Google.

seabbs commented 3 years ago

have you tried tracing the data back? The summary your drawing from is made using get_us_deaths so it should be very possible to track down an issue.

nikosbosse commented 3 years ago

working on it :)

seabbs commented 3 years ago

Linked to: #93

nikosbosse commented 3 years ago

it's taking me a while to track down - do you know from the top of your head where the epinow regional data comes from?

seabbs commented 3 years ago

https://github.com/epiforecasts/covid-us-forecasts/blob/ab05deead011d72e9f89a36a5f333200d5d26417/models/rt/update-rt.R#L19

nikosbosse commented 3 years ago

I'm very confused... three different versions to get data, three different results. Presumably I'm just tired and it is really obvious what's going on...

# option 1

source(here::here("utils", "get-us-data.R"))

weekly_deaths_state <- get_us_deaths(data = "daily") %>%
  filter(date >= (as.Date(forecast_date) - 8 * 4)) %>%
  group_by(state, epiweek) %>%
  summarise(deaths = sum(deaths), 
            target_end_date = max(date),
            .groups = "drop_last") %>%
  dplyr::select(-epiweek) %>%
  dplyr::ungroup()

weekly_deaths_national <- weekly_deaths_state %>%
  group_by(target_end_date) %>%
  summarise(deaths = sum(deaths), .groups = "drop_last") %>%
  mutate(state = "US")

# combine and only keep complete epiweeks, marked by the day 'Saturday'
obs2 <- dplyr::bind_rows(weekly_deaths_state, weekly_deaths_national) %>%
  dplyr::filter(weekdays(target_end_date) == "Saturday", 
                target_end_date <= as.Date(forecast_date))

# option 2

deaths <- get_us_deaths(data = "daily")
deaths <- as.data.table(deaths)
deaths <- deaths[, .(region = state, date = as.Date(date), 
                     confirm = deaths)]
us_deaths <- copy(deaths)[, .(confirm = sum(confirm, na.rm = TRUE)), by = "date"]
us_deaths <- us_deaths[, region := "US"]
deaths <- rbindlist(list(us_deaths, deaths), use.names = TRUE)
deaths <- deaths[date <= as.Date(target_date)]
deaths <- deaths[date >= (as.Date(target_date) - weeks(12))]
setorder(deaths, region, date)

obs <- deaths[, .(date, state = region, value = confirm)]

state_codes <- readRDS(here("data", "state_codes.rds"))
obs <- obs[state_codes, on = "state"]

source(here("utils", "dates-to-epiweek.R"))
obs <- dates_to_epiweek(obs)
obs <- obs[epiweek_full == TRUE, .(value = sum(value), date = max(date)), 
           by = .(location, state, epiweek)][, epiweek := NULL]

# option 3

obs3 <- load_observations("2021-02-15")

tail(obs2)
tail(obs)
tail(obs3)

seabbs commented 3 years ago

load_observations exists to load the truth data used for modelling which is stored by EpiNow2 (hence the file path). This means we can evaluate and ensemble against the correct data rather than using data that is updated retrospectively. Once anomaly correction is added this function needs to draw from another folder in which non-adjusted truth data is stored by date.

seabbs commented 3 years ago

The reason data input is different in the time series is that they were written by different people and standardisation was difficult.

I don't know why load-observations gives a different result but it needs more investigation. I'd suggest graphing it.

In general the use of data here had always been quite disjointed and a little messy. It would be good to rationalise aside from this potential bug.

nikosbosse commented 3 years ago

ok it seems like at least the first two do agree (only for some reason the green one has a week of data more when filtering for the same period).

Difference apparently mostly comes from Ohio. But for some reason the US curves don't agree even if almost all of the state curves do agree.

This is I assume because the data we download with get_us_deaths() gets corrected?

Questions are then:

which of these data streams should I use to generate the historic baseline forceasts? (the Rt data?)
which of these to use for future baseline forecasts (this shouldn't matter, right? except to avoid the dependency, so the non-Rt-data?)
do we want to externalise this to a separate package? I had started working on something here https://github.com/epiforecasts/forecasthubutils a while ago, but it never really got used

seabbs commented 3 years ago

This sounds like potentially it is due to the internal anomaly handling in EpiNow2 but I am not totally convinced as all that is is setting days with 0 cases to a local average.

I am not sure a split out package is required to handle data only processing? Though potentially for some of the processing tasks.

We need:

Raw data downloading and storage in a dated csv file.
A single processing of data from state level to US level that is then used everywhere.
Raw data processing with a basic anomaly correct (i.e something like set to moving average if outside 3 standard deviations in the data)
Flag on when anomaly correction has occurred and some kind of reporting process for this.
Use anomaly corrected dated data for all downstream modelling work.
Use raw dated data for plotting.

Anomaly correction in EpiNow2.

https://github.com/epiforecasts/EpiNow2/blob/d2b2aa6e76190000d5aad37e66f132f7c44d4644/R/create.R#L34

nikosbosse commented 3 years ago

Sounds very reasonable. Should we have a quick chat at some point to discuss how to move forward and divide up work?

as this is related to https://github.com/epiforecasts/covid-us-forecasts/pull/88 (that one isn't merged because the plotting hasn't happened yet): how should I proceed with the PR? Keep it open until we solved the data issues, then do all the past plots there and then merge?

seabbs commented 3 years ago

Sounds like a good idea.

Nite sure why this is blocking #88? I'd prefer to keep PRs modular if possible.

nikosbosse commented 3 years ago

What is missing from #88 is an update of past plots with all models. I'm however unsure what data to use for the plotting

Sam Abbott notifications@github.com schrieb am Mi., 24. Feb. 2021, 20:51:

Sounds like a good idea.

Nite sure why this is blocking #88 https://github.com/epiforecasts/covid-us-forecasts/pull/88? I'd prefer to keep PRs modular if possible.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/epiforecasts/covid-us-forecasts/issues/96#issuecomment-785330853, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJBYFLNNA2RXSGP5LSR6CF3TAVKETANCNFSM4X2R7B7Q .

seabbs commented 3 years ago

Can you use the structure as present and we can fix the underling data it draws from later?

nikosbosse commented 3 years ago

:+1: did that. PR can be merged now I think and then we can address this issue

epiforecasts / covid-us-forecasts

load_observations() returns too many deaths #96