Data standard: General points and list of standards for specific data types

Discussions of the data standard for specific types of data are linked below:

Other aspects of the data standard are linked below:

A list of some sources of inspiration for the data standard:

Some considerations for the data standard:

Date variables that mean different things
- Cases: Date the case was reported publicly (most provinces), date the case was reported internally (e.g., BC), “episode date” (i.e., proxy for date of symptom onset) (e.g., ON)
- Testing: Date sample was taken, date result was reported (QC reports a variety of dates)
Values that mean different things
- Cases: Confirmed cases, confirmed & probable cases, epi-linked cases (e.g., BC, QC)
- Testing: Tests performed, people tested
- Hosp/ICU: occupancy versus daily admissions (most provinces report only the former)
Values that apply to different geographies
- Province/territory
- Health region (sub_region_1)
- Sub-health region (e.g., individual city, sub_region_2)
Values that apply to different populations
- Demographics (e.g., age and sex)
- Residents, non-residents, residents & non-residents
- Vaccination status
- Locally versus travel-acquired cases (especially important for Atlantic Canada)
- Testing: provincial versus private testing (e.g., BC)
Other considerations
- How to combine all these data, especially if different sources have different sub-group information? (Some sources will provide demographic data, some sources will provide vaccination status, some sources will provide multiple categories of information but use INCOMPATIBLE groupings, like different age bands)
- How to deal with cases missing health region information ("not reported") and resident out-of-province cases (double counting?)
- How to handle when definitions change mid-time series (and no retroactive correction is available)? (e.g., Ontario testing definition may be an example of this)
- How to handle data about repatriated travellers (i.e., we have case and testing data, recovered and mortality can be inferred)
- Should "impossible data" (e.g., negative change in cases) be automatically corrected in the final product?

And of course the most important question of all: wide or long format (or multiple separate datasets instead of long format)?

ccodwg / CovidDataStandard

Data standard: General points and list of standards for specific data types #1

2

3

4

5

6

9

10

11

7

8