Open jeanpaulrsoucy opened 2 years ago
A precedent for community-maintained data scraping/scrapers: the Police Data Accessibility Project. They even have some kind of Python GUI for helping users write scrapers (note: I haven't checked this out yet).
The list of active and inactive datasets in the Canadian COVID-19 Canada Data Archive, along with all associated metadata, is given in
datasets.json
. It has hundreds of entries.This is also the data format used by
archivist
andCovid19CanadaArchive
used to produce the nightly automated data updates (#2). It is also used to keep the COVID-19 Canada Open Data Working Group datasets updated (seeCovid19CanadaETL
,Covid19Canada
andCovidTimelineCanada
. All datasets are identified with a unique UUID generated by UUID version 4.This list is maintained manually by the maintainer (me) based on personal knowledge of Canadian COVID-19 datasets as well as tips from data users in the form of personal communications or GitHub issues. Naturally, this is work-intensive and it is not always obvious when a new dataset is available or an old dataset has been retired, leading to (potential) loss of the historical record.
Main areas of improvement
These are the main areas of improvement I see for improving sustainability of the dataset list maintenance:
utils.py
currently contains two commonly used functions: (retire_dataset
), which moves a dataset from "active" to "inactive" in the list of datasets (datasets.json
) andlist_inactive_datasets
, which creates a list of datasets that have produces identical files for a certain number of days, suggesting the dataset may no longer be updated and can be safely moved to "inactive" statusdatasets.json
) must be validated before they are accepted in order to not disrupt tools that rely on the list of datasetsdatasets.json
) would have to be made compatible with the existing tools that use this file, such as the nightly archive update process andCovid19CanadaETL
datasets.json
could be converted to some existing format/standard for this sort of data?It would be helpful to find precedents for a community-maintained dataset archive/scraping list.