ccodwg / FAIRCovid19DataProject

A repository to organize the FAIR COVID-19 Data for 🇨🇦 project. Led by the COVID-19 Canada Open Data Working Group and supported by CANMOD.
https://whathappened.coronavirus.icu/
0 stars 0 forks source link

FAIR DATA: Pipeline for processing archived datasets into a common format #15

Open jeanpaulrsoucy opened 2 years ago

jeanpaulrsoucy commented 2 years ago

Once a common output data format is established (#10), a huge number of workflows will need to be developed, one per dataset, in order to transform raw, archived data into FAIR data.

The exact nature of these data workflows has not yet been decided, but will likely include one or more of: SQL, Python, R and related tools.

These workflows require a few different features:

Since May 2021, automation has been used to maintain the Covid19Canada and CovidTimelineCanada datasets. This process involved writing and maintaining a significant amount of R code to process dozens of existing datasets, see the R packages Covid19CanadaETL, Covid19CanadaData and Covid19CanadaDataProcess to view this existing and ongoing effort.

colliand commented 2 years ago

Establishing the "common output data format" might involve ongoing discussions with a variety of stakeholders. An open discussion on data formats with all these stakeholders will likely be difficult to convene and it will be challenging to arrive at technical standards.

FAIR provides some scaffolding for how data should be structured to accelerate collaboration and discovery. Our project, ideally in collaboration with other CANMOD personnel, should strive to finesse these obstacles to standardization. CCODWG data streams are being used already by policymakers. I suggest that our project should make opinionated choices on the best paths forward toward FAIRification while trying to avoid coding ourselves into a corner. Technical debt in a project that delivers value is better than spinning our wheels in governance and standardization meetings...

jeanpaulrsoucy commented 2 years ago

Agreed @colliand. I think we can use a simple provisional standard that looks something like the below in the mean time. Always time to revise in response to feedback down the road...

This is adapted from the existing format for CovidTimelineCanada, which itself is inspired by various sources including the Google community mobility reports. It's also similar to the existing format for Covid19Canada, which as you mentioned is already in wide use.