ccodwg / CovidDataStandard

A data and metadata standard for COVID-19 data in Canada.
5 stars 0 forks source link

DISCUSSION: What do you want from a COVID data standard? #12

Open jeanpaulrsoucy opened 2 years ago

jeanpaulrsoucy commented 2 years ago

General discussion on what you want from a COVID-19 data standard (esp. as applied to Canadian data).

It could be useful to design a survey for potential data users, asking questions like:

nick-gibb commented 2 years ago

I think FAIR-compliance is a requirement (https://www.go-fair.org/fair-principles/). I'm wondering if there already efforts here that we can piggyback on or adopt for our own needs.

jeanpaulrsoucy commented 2 years ago

I think FAIR-compliance is a requirement (https://www.go-fair.org/fair-principles/). I'm wondering if there already efforts here that we can piggyback on or adopt for our own needs.

Definitely. Are you able to pull the parts that are most immediately relevant to our endeavour and post them here?

ericeasthope commented 2 years ago

General discussion on what you want from a COVID-19 data standard (esp. as applied to Canadian data).

Speaking from a visualization design perspective, especially for web applications, I must emphasize the importance of coalescing data into easily manipulated formats like JSON, CSV/TSV, or similar. Whether to use JSON versus CSV/TSV largely depends on the dimensionality of the data. For example, matrices/tensors don't have meaningful keys (or labels if you will), so the key-value structure of JSON is poorly suited to them. Accordingly, CSV/TSV is more suitable. However if we're counting say, cases in regional health districts, JSON can encode richer label-like structures.

I mention this primarily in the context of data transferability. There should be no ambiguity once I've renamed a few data files, split them into folders, moved the folders around, etc., as to which data belongs where and how I should interpret values. Correspondingly, rows/columns (depending on your data) must be named: none of this "refer to the index in our README for corresponding labels" malarkey. In Python, Pandas dataframes make it relatively easy to add additional metadata, like headers, and are readily exportable to JSON and CSV/TSV.

This is a sharp take, but I've found that if row/column names aren't encoded directly in data, either as key-value pairs or as headers, those rows/columns are useless. It can get worse: data gets lost along the way to a nameless entropy. Effectively unnamed data is noise, and we want to minimize noise to ensure the data narratives we tell are salient ones. Hope this helps!