ccodwg / CovidTimelineCanada

A definitive dataset for COVID-19 in Canada
https://opencovid.ca/
Other
27 stars 11 forks source link

Alternative PT-level time series #120

Closed jeanpaulrsoucy closed 11 months ago

jeanpaulrsoucy commented 11 months ago

For some provinces, there are alternate PT-level time series that offer superior accuracy and/or temporal resolution compared to the HR-level time series. This applies mainly to death data. In fact, many provinces that report cases/deaths via weekly reports may be better served by simply using the PHAC time series.

Only twenty time series are potentially eligible for this treatment (10 provinces * 2 metrics). The first step is determining which province-metric combinations process could apply to:

Cases

Deaths

jeanpaulrsoucy commented 11 months ago

This issue includes #81.

jeanpaulrsoucy commented 11 months ago

For SK deaths: on 2022-06-25, the HR-level deaths time series is 1426 and the PHAC cumulative total is 1427, which is extremely close agreement. However, by 2023-09-23 we have 1903 and 2007, respectively, which is a large deviation.

The PHAC time series for SK seems to be complete and receive updates in every week. For the sake of accuracy, we can replace the dataset after 2022-06-25 with the PHAC dataset, even if it makes it slightly less timely. (#118, #81)

jeanpaulrsoucy commented 11 months ago

The PHAC death dataset seems wrong for NS after the beginning of monthly epidemiology report:

Original dataset deaths on 2023-05-22: 866; PHAC dataset deaths on 2023-05-20/2023-05-27/2023-06-03: 866 (all these dates are listed as updated and have case updates).

However, by the end of August 2023, you have original dataset: 907; PHAC dataset: 871. They seem to be using numbers different from the "monthly" numbers, but much smaller than cumulative differences we calculate for the deaths (and different from the cumulative differences used for the cases, see post below).

The PHAC dataset should not be used for NS deaths post-monthly epidemiology report.

For the potential partial replacement from 2021-12-10 to 2022-06-08, we have:

Original dataset: 110, 421 PHAC dataset (daily): 110, 421

So there is no reason to replace it, either. Additionally, the PHAC dataset goes weekly beginning 2022-03-08, the same time the original dataset does.

jeanpaulrsoucy commented 11 months ago

Comparison with the PHAC cases dataset:

Original dataset cases on 2023-05-22: 145789; PHAC dataset cases on 2023-05-27: 143696.

By the end of August 2023, you have original dataset: 147179; PHAC dataset: 145019.

Because of the differences in the cumulative values, it would be difficult to use the NS dataset as an alternative.

The PHAC dataset does seem to use cumulative differences for cases, as we get the same case values for June to August 2023: 460, 281, 454.

For the potential partial replacement from 2021-12-10 to 2022-06-08, we have:

Original dataset: 8537, 101913 PHAC dataset (daily): 8655, 99514

Again, there is no real compelling reason to make the replacement, as cases start out ahead but then fall behind the original dataset. Additionally, the PHAC dataset goes weekly beginning 2022-03-08, whereas some daily data are still available in the original dataset after this.

jeanpaulrsoucy commented 11 months ago

Notes regarding replacement of AB, ON, SK death data:

jeanpaulrsoucy commented 11 months ago

Note that the NB PHAC data currently has another problematic jump in cases, this one not related to the death reconciliation and probably an error. Thus, it should not be used at this time.

2022-12-10 -> 2022-12-17: 628 -> 722. The week ending 2022-12-17 is the first week that the NB report (https://www2.gnb.ca/content/dam/gnb/Departments/eco-bce/Promo/covid-19/report-rapport/covid-report-2022-week-50.pdf) went from reporting cumulative deaths (all time) to reporting weekly deaths and cumulative deaths SINCE THE BEGINNING OF AUGUST 28, 2022. In this report, they give 7 weekly deaths and 95 cumulative deaths since August 28, 2022. Notice in the PHAC dataset we see an increase of 94 deaths (722 - 628) on 2022-12-17. This is almost exactly the cumulative deaths since August 28, 2022 (95). Yes, the count of 7 weekly deaths is almost certainly an undercount of the true addition to cumulative deaths since it does not account for historical revisions, the true number is also probably not 94!

jeanpaulrsoucy commented 11 months ago

NL death data can be considered for replacement with PHAC data (starting when updates went biweekly, later monthly), since the temporal resolution is better (weekly rather than biweekly/monthly). I will wait for the next PHAC data update to compare. If the PHAC deaths are significantly lower than the NL datasets calculated using the daily death values, then I will be considered about data quality issues and not replace it.

jeanpaulrsoucy commented 11 months ago

Regarding the idea of full replacement of SK data after 2022-06-25, the problem is the difference in cumulative values for SK between the PHAC dataset and current dataset on this date:

Current: 2022-06-25 / 138332 PHAC: 2022-06-25 / 139661

The final date of the original SK dataset (2022-02-06) doesn't line up with the PHAC update dates (Saturday), so we can compare the closest date (2022-02-05):

Current: 2022-02-05 / 122525 PHAC: 2022-02-05 / 122525

So we can start using the PHAC dataset for SK on 2022-02-12.

NB's cumulative cases line up at the common end of the datasets:

Current: 2023-10-07 / 92101 PHAC: 2023-10-07 / 92101

MB's cumulative cases are very close:

Current: 2023-10-14 / 157850 PHAC: 2023-10-14 / 157852

jeanpaulrsoucy commented 11 months ago

If we look at the overlap between the current datasets and the PHAC dataset for the periods mentioned above, we get the following:

Rplot

The current dataset is in black and the PHAC dataset is in blue.

For MB, the PHAC dataset has a gap that I managed to smooth out with the report data. The cumulative values are almost the same, as well, so there is no real benefit to using the PHAC dataset here. The PHAC dataset smooths out some gaps from the reports for NB and SK, so these should definitely be used as PT-level replacements. Should also investigate why the SK cumulative values depart even though the weekly values seem to line up for the period of the CRISP reports.

Code:

# load packages
library(dplyr)
library(readr)
library(lubridate)
library(ggplot2)

# load data
t1 <- read_csv("data/pt/cases_pt.csv") |> filter(region %in% c("MB", "NB", "SK"))
t2 <- read_csv("raw_data/active_ts/can/can_cases_pt_ts.csv") |> filter(region %in% c("MB", "NB", "SK")) |>
  select(-update)
t2 <- group_by(t2, region) |>
  mutate(value_daily = c(0, diff(value))) |>
  ungroup()

# filter to Saturdays for the current dataset
t1 <- t1 |> filter(weekdays(date) == "Saturday")
t2 <- t2 |> filter(weekdays(date) == "Saturday")

# filter to comparison dates
t1 <- t1 |> filter(!(region == "MB" & date < "2022-11-05"))
t2 <- t2 |> filter(!(region == "MB" & date < "2022-11-05"))
t1 <- t1 |> filter(!(region == "NB" & date < "2022-12-10"))
t2 <- t2 |> filter(!(region == "NB" & date < "2022-12-10"))
t1 <- t1 |> filter(!(region == "SK" & date < "2022-02-06"))
t2 <- t2 |> filter(!(region == "SK" & date < "2022-02-06"))

# plot comparisons of overlapping period
ggplot(data = NULL, aes(x = date, y = value_daily)) +
  geom_line(data = t1,
            colour = "black") +
  geom_line(data = t2,
            colour = "blue") +
  facet_wrap(~region, ncol = 4, scales = "free") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(x = NULL, y = "Cases (weekly)")
jeanpaulrsoucy commented 11 months ago

Full replacement of the HR-level time series with the PT-level PHAC data is not feasible using cumulative values, since cumulative value diverge immediately after the SK dashboard time series ends, although you could use the weekly values and simply append them.

HR-level data aggregated to the PT-level (black) versus PHAC PT-level data (blue):

Rplot

Code:

# load packages
library(dplyr)
library(readr)
library(lubridate)
library(ggplot2)

# load data
t1 <- read_csv("data/hr/cases_hr.csv") |>
  filter(region == "SK") |>
  group_by(name, region, date) |>
  summarize(value = sum(value))
t2 <- read_csv("raw_data/active_ts/can/can_cases_pt_ts.csv") |>
  filter(region == "SK") |>
  filter(update == 1)

# filter to Saturdays for the current dataset
t1 <- t1 |> filter(weekdays(date) == "Saturday")
t2 <- t2 |> filter(weekdays(date) == "Saturday")

# filter to comparison dates
t1 <- t1 |> filter(date >= "2022-02-05")
t2 <- t2 |> filter(date >= "2022-02-05")
t1 <- t1 |> filter(date <= max(t2$date))
t2 <- t2 |> filter(date <= max(t1$date))

# plot comparisons of overlapping period
ggplot(data = NULL, aes(x = date, y = value)) +
  geom_line(data = t1,
            colour = "black") +
  geom_line(data = t2,
            colour = "blue") +
  facet_wrap(~region, ncol = 4, scales = "free") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(x = NULL, y = "Cumulative cases")
jeanpaulrsoucy commented 11 months ago

Interestingly, the HR-level dataset has 138332 cases on 2022-06-25 and the PHAC dataset has 139661. However, counting only cumulative cases reported starting 2022-07-02 (the first report without HR-level data), the cumulative number of cases reported between the two datasets are almost exactly the same (17848 versus 17850 on 2023-10-07). This supports a full replacement using weekly values, since doing so smooths over the gaps left by the monthly dataset.

Rplot01

Code:

# load packages
library(dplyr)
library(readr)
library(lubridate)
library(ggplot2)

# load data
t1 <- read_csv("data/hr/cases_hr.csv") |>
  filter(region == "SK") |>
  group_by(name, region, date) |>
  summarize(value = sum(value), value_daily = sum(value_daily), .groups = "drop")
t2 <- read_csv("raw_data/active_ts/can/can_cases_pt_ts.csv") |>
  filter(region == "SK") |>
  filter(update == 1)

# filter to Saturdays for the current dataset
t1 <- t1 |> filter(weekdays(date) == "Saturday")
t2 <- t2 |> filter(weekdays(date) == "Saturday")

# cases starting 2022-07-02
t1$value <- t1$value - t1[t1$date == "2022-06-25", "value", drop = TRUE]
t2$value <- t2$value - t2[t2$date == "2022-06-25", "value", drop = TRUE]

# filter to comparison dates
t1 <- t1 |> filter(date >= "2022-07-02")
t2 <- t2 |> filter(date >= "2022-07-02")
t1 <- t1 |> filter(date <= max(t2$date))
t2 <- t2 |> filter(date <= max(t1$date))

# plot comparisons of overlapping period
ggplot(data = NULL, aes(x = date, y = value)) +
  geom_line(data = t1,
            colour = "black") +
  geom_line(data = t2,
            colour = "blue") +
  facet_wrap(~region, ncol = 4, scales = "free") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(x = NULL, y = "Cumulative cases starting 2022-07-02")
jeanpaulrsoucy commented 11 months ago

Once you visualize it like this, you realize that the way that the way the PHAC data smoothed over the missing weeks in the three monthly reports was simply equally dividing cases across each of the weeks. Could the 2-case difference result from rounding errors from these calculations? No big deal either way.

Screenshot 2023-10-29 at 14-06-46 Timeline of COVID-19 in Canada Dashboard

jeanpaulrsoucy commented 11 months ago

Will not be using be using PHAC time series for NL deaths because deaths diverge in 2023 (seem to be undercounted in the PHAC dataset), even though updates are smoother. This divergence seems concentrated in the period after the province reported cumulative HR-level deaths (ended 2023-06-21), but seems to have begun a few months earlier. It is possible that PHAC is not properly counting deaths from the previous period, or perhaps retroactive updates on behalf of NL actually mean our estimates our wrong.

Rplot

Rplot01

Either way, we will continue with the existing strategy for updating NL death data (#113).

Code:

# load packages
library(dplyr)
library(readr)
library(lubridate)
library(ggplot2)

# load data
t1 <- read_csv("data/pt/deaths_pt.csv") |>
  filter(region == "NL")
t2 <- read_csv("raw_data/active_ts/can/can_deaths_pt_ts.csv") |>
  filter(region == "NL") |>
  filter(update == 1)

# filter to Saturdays for the current dataset
t1 <- t1 |> filter(weekdays(date) == "Saturday")
t2 <- t2 |> filter(weekdays(date) == "Saturday")

# filter to comparison dates
t1 <- t1 |> filter(date <= max(t2$date))
t2 <- t2 |> filter(date <= max(t1$date))

# plot comparisons
ggplot(data = NULL, aes(x = date, y = value)) +
  geom_line(data = t1,
            colour = "black") +
  geom_line(data = t2,
            colour = "blue") +
  facet_wrap(~region, ncol = 4, scales = "free") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(x = NULL, y = "Cumulative deaths")

# plot compoarisons in 2023
ggplot(data = NULL, aes(x = date, y = value)) +
  geom_line(data = t1 |> filter(date >= "2023-01-01"),
            colour = "black") +
  geom_line(data = t2 |> filter(date >= "2023-01-01"),
            colour = "blue") +
  facet_wrap(~region, ncol = 4, scales = "free") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(x = NULL, y = "Cumulative deaths")