Closed jeanpaulrsoucy closed 11 months ago
This issue includes #81.
For SK deaths: on 2022-06-25, the HR-level deaths time series is 1426 and the PHAC cumulative total is 1427, which is extremely close agreement. However, by 2023-09-23 we have 1903 and 2007, respectively, which is a large deviation.
The PHAC time series for SK seems to be complete and receive updates in every week. For the sake of accuracy, we can replace the dataset after 2022-06-25 with the PHAC dataset, even if it makes it slightly less timely. (#118, #81)
The PHAC death dataset seems wrong for NS after the beginning of monthly epidemiology report:
Original dataset deaths on 2023-05-22: 866; PHAC dataset deaths on 2023-05-20/2023-05-27/2023-06-03: 866 (all these dates are listed as updated and have case updates).
However, by the end of August 2023, you have original dataset: 907; PHAC dataset: 871. They seem to be using numbers different from the "monthly" numbers, but much smaller than cumulative differences we calculate for the deaths (and different from the cumulative differences used for the cases, see post below).
The PHAC dataset should not be used for NS deaths post-monthly epidemiology report.
For the potential partial replacement from 2021-12-10 to 2022-06-08, we have:
Original dataset: 110, 421 PHAC dataset (daily): 110, 421
So there is no reason to replace it, either. Additionally, the PHAC dataset goes weekly beginning 2022-03-08, the same time the original dataset does.
Comparison with the PHAC cases dataset:
Original dataset cases on 2023-05-22: 145789; PHAC dataset cases on 2023-05-27: 143696.
By the end of August 2023, you have original dataset: 147179; PHAC dataset: 145019.
Because of the differences in the cumulative values, it would be difficult to use the NS dataset as an alternative.
The PHAC dataset does seem to use cumulative differences for cases, as we get the same case values for June to August 2023: 460, 281, 454.
For the potential partial replacement from 2021-12-10 to 2022-06-08, we have:
Original dataset: 8537, 101913 PHAC dataset (daily): 8655, 99514
Again, there is no real compelling reason to make the replacement, as cases start out ahead but then fall behind the original dataset. Additionally, the PHAC dataset goes weekly beginning 2022-03-08, whereas some daily data are still available in the original dataset after this.
Notes regarding replacement of AB, ON, SK death data:
active_ts/ab/ab_deaths_pt_ts.csv
), as it is not neededNote that the NB PHAC data currently has another problematic jump in cases, this one not related to the death reconciliation and probably an error. Thus, it should not be used at this time.
2022-12-10 -> 2022-12-17: 628 -> 722. The week ending 2022-12-17 is the first week that the NB report (https://www2.gnb.ca/content/dam/gnb/Departments/eco-bce/Promo/covid-19/report-rapport/covid-report-2022-week-50.pdf) went from reporting cumulative deaths (all time) to reporting weekly deaths and cumulative deaths SINCE THE BEGINNING OF AUGUST 28, 2022. In this report, they give 7 weekly deaths and 95 cumulative deaths since August 28, 2022. Notice in the PHAC dataset we see an increase of 94 deaths (722 - 628) on 2022-12-17. This is almost exactly the cumulative deaths since August 28, 2022 (95). Yes, the count of 7 weekly deaths is almost certainly an undercount of the true addition to cumulative deaths since it does not account for historical revisions, the true number is also probably not 94!
NL death data can be considered for replacement with PHAC data (starting when updates went biweekly, later monthly), since the temporal resolution is better (weekly rather than biweekly/monthly). I will wait for the next PHAC data update to compare. If the PHAC deaths are significantly lower than the NL datasets calculated using the daily death values, then I will be considered about data quality issues and not replace it.
Regarding the idea of full replacement of SK data after 2022-06-25, the problem is the difference in cumulative values for SK between the PHAC dataset and current dataset on this date:
Current: 2022-06-25 / 138332 PHAC: 2022-06-25 / 139661
The final date of the original SK dataset (2022-02-06) doesn't line up with the PHAC update dates (Saturday), so we can compare the closest date (2022-02-05):
Current: 2022-02-05 / 122525 PHAC: 2022-02-05 / 122525
So we can start using the PHAC dataset for SK on 2022-02-12.
NB's cumulative cases line up at the common end of the datasets:
Current: 2023-10-07 / 92101 PHAC: 2023-10-07 / 92101
MB's cumulative cases are very close:
Current: 2023-10-14 / 157850 PHAC: 2023-10-14 / 157852
If we look at the overlap between the current datasets and the PHAC dataset for the periods mentioned above, we get the following:
The current dataset is in black and the PHAC dataset is in blue.
For MB, the PHAC dataset has a gap that I managed to smooth out with the report data. The cumulative values are almost the same, as well, so there is no real benefit to using the PHAC dataset here. The PHAC dataset smooths out some gaps from the reports for NB and SK, so these should definitely be used as PT-level replacements. Should also investigate why the SK cumulative values depart even though the weekly values seem to line up for the period of the CRISP reports.
Code:
# load packages
library(dplyr)
library(readr)
library(lubridate)
library(ggplot2)
# load data
t1 <- read_csv("data/pt/cases_pt.csv") |> filter(region %in% c("MB", "NB", "SK"))
t2 <- read_csv("raw_data/active_ts/can/can_cases_pt_ts.csv") |> filter(region %in% c("MB", "NB", "SK")) |>
select(-update)
t2 <- group_by(t2, region) |>
mutate(value_daily = c(0, diff(value))) |>
ungroup()
# filter to Saturdays for the current dataset
t1 <- t1 |> filter(weekdays(date) == "Saturday")
t2 <- t2 |> filter(weekdays(date) == "Saturday")
# filter to comparison dates
t1 <- t1 |> filter(!(region == "MB" & date < "2022-11-05"))
t2 <- t2 |> filter(!(region == "MB" & date < "2022-11-05"))
t1 <- t1 |> filter(!(region == "NB" & date < "2022-12-10"))
t2 <- t2 |> filter(!(region == "NB" & date < "2022-12-10"))
t1 <- t1 |> filter(!(region == "SK" & date < "2022-02-06"))
t2 <- t2 |> filter(!(region == "SK" & date < "2022-02-06"))
# plot comparisons of overlapping period
ggplot(data = NULL, aes(x = date, y = value_daily)) +
geom_line(data = t1,
colour = "black") +
geom_line(data = t2,
colour = "blue") +
facet_wrap(~region, ncol = 4, scales = "free") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(x = NULL, y = "Cases (weekly)")
Full replacement of the HR-level time series with the PT-level PHAC data is not feasible using cumulative values, since cumulative value diverge immediately after the SK dashboard time series ends, although you could use the weekly values and simply append them.
HR-level data aggregated to the PT-level (black) versus PHAC PT-level data (blue):
Code:
# load packages
library(dplyr)
library(readr)
library(lubridate)
library(ggplot2)
# load data
t1 <- read_csv("data/hr/cases_hr.csv") |>
filter(region == "SK") |>
group_by(name, region, date) |>
summarize(value = sum(value))
t2 <- read_csv("raw_data/active_ts/can/can_cases_pt_ts.csv") |>
filter(region == "SK") |>
filter(update == 1)
# filter to Saturdays for the current dataset
t1 <- t1 |> filter(weekdays(date) == "Saturday")
t2 <- t2 |> filter(weekdays(date) == "Saturday")
# filter to comparison dates
t1 <- t1 |> filter(date >= "2022-02-05")
t2 <- t2 |> filter(date >= "2022-02-05")
t1 <- t1 |> filter(date <= max(t2$date))
t2 <- t2 |> filter(date <= max(t1$date))
# plot comparisons of overlapping period
ggplot(data = NULL, aes(x = date, y = value)) +
geom_line(data = t1,
colour = "black") +
geom_line(data = t2,
colour = "blue") +
facet_wrap(~region, ncol = 4, scales = "free") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(x = NULL, y = "Cumulative cases")
Interestingly, the HR-level dataset has 138332 cases on 2022-06-25 and the PHAC dataset has 139661. However, counting only cumulative cases reported starting 2022-07-02 (the first report without HR-level data), the cumulative number of cases reported between the two datasets are almost exactly the same (17848 versus 17850 on 2023-10-07). This supports a full replacement using weekly values, since doing so smooths over the gaps left by the monthly dataset.
Code:
# load packages
library(dplyr)
library(readr)
library(lubridate)
library(ggplot2)
# load data
t1 <- read_csv("data/hr/cases_hr.csv") |>
filter(region == "SK") |>
group_by(name, region, date) |>
summarize(value = sum(value), value_daily = sum(value_daily), .groups = "drop")
t2 <- read_csv("raw_data/active_ts/can/can_cases_pt_ts.csv") |>
filter(region == "SK") |>
filter(update == 1)
# filter to Saturdays for the current dataset
t1 <- t1 |> filter(weekdays(date) == "Saturday")
t2 <- t2 |> filter(weekdays(date) == "Saturday")
# cases starting 2022-07-02
t1$value <- t1$value - t1[t1$date == "2022-06-25", "value", drop = TRUE]
t2$value <- t2$value - t2[t2$date == "2022-06-25", "value", drop = TRUE]
# filter to comparison dates
t1 <- t1 |> filter(date >= "2022-07-02")
t2 <- t2 |> filter(date >= "2022-07-02")
t1 <- t1 |> filter(date <= max(t2$date))
t2 <- t2 |> filter(date <= max(t1$date))
# plot comparisons of overlapping period
ggplot(data = NULL, aes(x = date, y = value)) +
geom_line(data = t1,
colour = "black") +
geom_line(data = t2,
colour = "blue") +
facet_wrap(~region, ncol = 4, scales = "free") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(x = NULL, y = "Cumulative cases starting 2022-07-02")
Once you visualize it like this, you realize that the way that the way the PHAC data smoothed over the missing weeks in the three monthly reports was simply equally dividing cases across each of the weeks. Could the 2-case difference result from rounding errors from these calculations? No big deal either way.
Will not be using be using PHAC time series for NL deaths because deaths diverge in 2023 (seem to be undercounted in the PHAC dataset), even though updates are smoother. This divergence seems concentrated in the period after the province reported cumulative HR-level deaths (ended 2023-06-21), but seems to have begun a few months earlier. It is possible that PHAC is not properly counting deaths from the previous period, or perhaps retroactive updates on behalf of NL actually mean our estimates our wrong.
Either way, we will continue with the existing strategy for updating NL death data (#113).
Code:
# load packages
library(dplyr)
library(readr)
library(lubridate)
library(ggplot2)
# load data
t1 <- read_csv("data/pt/deaths_pt.csv") |>
filter(region == "NL")
t2 <- read_csv("raw_data/active_ts/can/can_deaths_pt_ts.csv") |>
filter(region == "NL") |>
filter(update == 1)
# filter to Saturdays for the current dataset
t1 <- t1 |> filter(weekdays(date) == "Saturday")
t2 <- t2 |> filter(weekdays(date) == "Saturday")
# filter to comparison dates
t1 <- t1 |> filter(date <= max(t2$date))
t2 <- t2 |> filter(date <= max(t1$date))
# plot comparisons
ggplot(data = NULL, aes(x = date, y = value)) +
geom_line(data = t1,
colour = "black") +
geom_line(data = t2,
colour = "blue") +
facet_wrap(~region, ncol = 4, scales = "free") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(x = NULL, y = "Cumulative deaths")
# plot compoarisons in 2023
ggplot(data = NULL, aes(x = date, y = value)) +
geom_line(data = t1 |> filter(date >= "2023-01-01"),
colour = "black") +
geom_line(data = t2 |> filter(date >= "2023-01-01"),
colour = "blue") +
facet_wrap(~region, ncol = 4, scales = "free") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(x = NULL, y = "Cumulative deaths")
For some provinces, there are alternate PT-level time series that offer superior accuracy and/or temporal resolution compared to the HR-level time series. This applies mainly to death data. In fact, many provinces that report cases/deaths via weekly reports may be better served by simply using the PHAC time series.
Only twenty time series are potentially eligible for this treatment (10 provinces * 2 metrics). The first step is determining which province-metric combinations process could apply to:
Cases
AB(no better alternative)BC(no better alternative)MB (PHAC after 2022-11-05)NL(no better alternative - data are not cumulative but are daily rather than weekly)NS (PHAC after 2023-05-22 - potentially replace current HR data; could technically do a partial replacement of 2021-12-10 to 2022-06-08 with daily PHAC data, explore further)(no to replacement and alternate - see discussion below)ON(full HR TS available)PE(currently uses PHAC - may replace)QC(full HR TS available)Deaths
(daily + weekly PT-level TS available)(daily + PHAC)BC(no better alternative)MB(uses PHAC for current updates)NB (PHAC after 2022-12-10; large bump from 2022-10-18 also exists in PHAC data, so no point replacing earlier )NB dataset both contains the large bump that this would be intended to smooth over plus an additional data error that makes later substitution pointless (see #80)NL(NL-based and PHAC-based estimates diverge in mid-2023)NS (PHAC after 2023-05-22 - potentially replace current HR data)(no to replacement and alternate - see discussion below)PE(currently uses PHAC - may replace)QC(full HR TS available)