ccodwg / FAIRCovid19DataProject

A repository to organize the FAIR COVID-19 Data for 🇨🇦 project. Led by the COVID-19 Canada Open Data Working Group and supported by CANMOD.
https://whathappened.coronavirus.icu/
0 stars 0 forks source link

SUSTAINABILITY: Maintaining the list of datasets for the Canadian COVID-19 Data Archive #4

Open jeanpaulrsoucy opened 2 years ago

jeanpaulrsoucy commented 2 years ago

The list of active and inactive datasets in the Canadian COVID-19 Canada Data Archive, along with all associated metadata, is given in datasets.json. It has hundreds of entries.

This is also the data format used by archivist and Covid19CanadaArchive used to produce the nightly automated data updates (#2). It is also used to keep the COVID-19 Canada Open Data Working Group datasets updated (see Covid19CanadaETL, Covid19Canada and CovidTimelineCanada. All datasets are identified with a unique UUID generated by UUID version 4.

This list is maintained manually by the maintainer (me) based on personal knowledge of Canadian COVID-19 datasets as well as tips from data users in the form of personal communications or GitHub issues. Naturally, this is work-intensive and it is not always obvious when a new dataset is available or an old dataset has been retired, leading to (potential) loss of the historical record.

Main areas of improvement

These are the main areas of improvement I see for improving sustainability of the dataset list maintenance:

It would be helpful to find precedents for a community-maintained dataset archive/scraping list.

jeanpaulrsoucy commented 2 years ago

A precedent for community-maintained data scraping/scrapers: the Police Data Accessibility Project. They even have some kind of Python GUI for helping users write scrapers (note: I haven't checked this out yet).