STATE OF THE PROJECT - Githubissues

jeanpaulrsoucy commented 2 years ago

This issue should be used for tracking and discussing the overall state of the project, FAIR COVID-19 Data for 🇨🇦.

Project goal

The goal of this project is to FAIRify the data stored in the Canadian COVID-19 Data Archive.

Slide deck

Need to create an evolving "State of the Project" slide deck (e.g., using HackMD).

Progress to date

The Canadian COVID-19 Data Archive is a publicly accessible repository storing daily snapshots of hundreds of datasets, web pages and reports related to the COVID-19 pandemic in Canada, operating since August 2020. Without the Archive, many of these datasets would already be lost. As of February 2022, the Archive contains over 160 GB of data spanning about 600 datasets. Data collection is automated via a custom Python package (archivist) and a set of GitHub actions Covid19CanadaBot. All data are available via a publicly exposed S3 bucket, documented in a data catalogue on GitHub and searchable using a basic data explorer.

The Canadian COVID-19 Data Archive is also being used to support the ongoing maintenance of the COVID-19 Canada Open Data Working Group's two datasets: Covid19Canada ("Epidemiological Data from the COVID-19 Outbreak in Canada") and CovidTimelineCanada ("Timeline of COVID-19 in Canada").

Tasks in progress

Data tool and API

The main output of this project will be a publicly accessible data tool allowing users to find, manipulate and export relevant Canadian COVID-19 data in a common, FAIR format. The simple front-end tool (#8) will be supported by a robust, well-documented API (#9) for more advanced users.

Rich metadata

The datasets available in the Archive must be supported by rich metadata allowing users to find, understand and apply each of the derived datasets (#7, #13, #4), including understanding data licenses (#5, #6).

Workflows for creating FAIR data

We will need to write custom workflows to process each dataset (#15) into a common, FAIR data format (#10).

A more sustainable process

Another goal of the project is to be sustainable, which begins with improving the sustainability of the Canadian COVID-19 Data Archive, both in terms of maintaining the automation (#2), list of datasets (#4) and the underlying infrastructure (#3).

colliand commented 2 years ago

While reviewing the issues, I am reminded that it is wise to put socks on before putting on shoes. I suggest that the anchor comment above be enriched with a phasing of main tasks. A GANTT chart (built with the newly available Mermaid.js offered by GitHub) may be helpful. To the extent possible, I suggest we strive to address the transitions between phases to allow for some work on later stages in our project to start while work is underway on earlier stages. Making this work will likely require some waterfall-style specifications to define the glue that links the components together. You can knit socks in the evening while working on shoes in the shop during the day...