ccodwg / CovidTimelineCanada

A definitive dataset for COVID-19 in Canada
https://opencovid.ca/
Other
26 stars 8 forks source link

Add vaccine administration data #47

Closed jeanpaulrsoucy closed 8 months ago

jeanpaulrsoucy commented 2 years ago

Vaccine administration data up to 5th doses (for some provinces) are available (PHAC data).

jeanpaulrsoucy commented 1 year ago

The Canada-level dataset now excludes Quebec, which causes large drops in dose numbers for the Canada-level dataset. Will have to figure out how to deal with this.

jeanpaulrsoucy commented 8 months ago

Note that PHAC vaccine administration dataset was originally daily, now converted into weekly data (similar to the case/death data). However, the vaccine coverage dataset (which also includes doses) was always weekly.

jeanpaulrsoucy commented 8 months ago

Confirmed that the PHAC vaccine dataset (194a0002-5ad1-4016-8788-e7a216216a92) and the QC vaccine dataset (4e04442d-f372-4357-ba15-3b64f4e03fbe) both have total doses columns that sum exactly to all the other columns (including an unknown dose column, in the case of PHAC, which is used mainly for QC).

jeanpaulrsoucy commented 8 months ago

Code to plot all dose time series:

  vaccine_administration_total_doses_pt <- dplyr::left_join(
    vaccine_administration_dose_1_pt |>
      dplyr::transmute(.data$date, .data$region, dose_1 = .data$value),
    vaccine_administration_dose_2_pt |>
      dplyr::transmute(.data$date, .data$region, dose_2 = .data$value),
    by = c("date", "region")) |>
    dplyr::left_join(
      vaccine_administration_dose_3_pt |>
        dplyr::transmute(.data$date, .data$region, dose_3 = .data$value),
      by = c("date", "region")) |>
    dplyr::left_join(
      vaccine_administration_dose_4_pt |>
        dplyr::transmute(.data$date, .data$region, dose_4 = .data$value),
      by = c("date", "region")) |>
    dplyr::left_join(
      vaccine_administration_dose_5plus_pt |>
        dplyr::transmute(.data$date, .data$region, dose_5plus = .data$value),
      by = c("date", "region")) |>
    dplyr::rowwise() |>
    dplyr::mutate(total_doses = sum(dose_1, dose_2, dose_3, dose_4, dose_5plus, na.rm = TRUE)) |>
    dplyr::ungroup()
  library(ggplot2)
  # in for loop, produce one plot of all doses for each province
  for (pt in unique(vaccine_administration_total_doses_pt$region)) {
    d <- vaccine_administration_total_doses_pt |>
      dplyr::filter(.data$region == pt)
    p <- ggplot(d, aes(x = date)) +
      geom_line(aes(y = dose_1)) +
      geom_line(aes(y = dose_2)) +
      geom_line(aes(y = dose_3)) +
      geom_line(aes(y = dose_4)) +
      geom_line(aes(y = dose_5plus)) +
      geom_line(aes(y = total_doses)) +
      geom_vline(xintercept = as.Date("2022-05-08")) +
      labs(title = pt)
    print(p)
  }
jeanpaulrsoucy commented 8 months ago

Summary of update to this dataset:

jeanpaulrsoucy commented 8 months ago

Final data note regarding first doses and total doses:

Dose 1: With the exception of QC, there may be anomalies in the time series on 2022-05-08 due to a transition from the original CCODWG dataset (ending 2022-05-03) to the PHAC data source (beginning 2022-05-08). With the exception of QC, first doses may be slightly overestimated prior to 2022-05-08. This is due to how first doses were calculated in the original CCODWG dataset compared to the PHAC dataset.

Total doses: With the exception of QC, there may be anomalies in the time series on 2022-05-08 due to a transition from the original CCODWG dataset (ending 2022-05-03) to the PHAC data source (beginning 2022-05-08). For the same reason, with the exception of QC, total doses may be slightly overestimated between 2021-07-03 (first date that fourth doses are reported) and 2022-05-08. This is because some dose 4/dose 5+ values may have been double-counted as first doses.

The cumulative value for total doses and dose 1 does indeed decline in most cases on 2022-05-08 (after the transition from the original CCODWG dataset to the PHAC dataset), although it's not clear the degree to which this is due to specifically the problem theorized above or due to general revisions in the numbers that occurred later. The opportunity to have avoided this in the CCODWG dataset was limited as many places did not explicitly report doses above 3 until much later, and how dose 1 was not explicitly recorded, only total doses, dose 2, and dose 3.

This could be revisited in the future by, for example, using the CCODWG dataset only up to the date when the dose 4 dataset beings (2021-07-03). However, this introduces other problems like loss of temporal granularity (daily versus weekly data) and potentially later start dates for the dose 3 time series. On the latter point, however, a cursory glance shows that the dose 3 time series may actually be better/earlier with the PHAC dataset compared to the original CCODWG dataset (unlike the dreadful dose 3 coverage values that motivated canada-covid-vaccine-coverage. An exception would have to be made for MB, however, given the problems with that dataset before 2022-05-08.

jeanpaulrsoucy commented 8 months ago

MB had a gap in reporting between 2022-03-30 (according to a5801472-42ae-409e-aedd-9bf92831434a) or 2022-03-31 (according to the original CCODWG dataset) until the end of the original CCODWG dataset (2022-05-03). While they do have numbers in the PHAC dataset for this period, the time series is bad and not corrected until 2022-05-22. Therefore there will be a gap in reporting (and an adjustment in numbers) no matter what I do.

jeanpaulrsoucy commented 8 months ago

Based on the thoughts above and further experimentation, the final update, summarize below:

jeanpaulrsoucy commented 8 months ago

Updated early ON dataset using official time series so that we don't have early missing dose 2 information for the province. Also made sure to actually write the dose 5+ datasets.