SUSTAINABILITY: Automated data collection for the Canadian COVID-19 Data Archive

Presently, data collection for the Canadian COVID-19 Data Archive (Covid19CanadaArchive) is managed through a combination of Python scripts, the self-developed Python package archivist and a series of GitHub actions run by Covid19CanadaBot. A basic flowchart of the current process (taken from the aforementioned repository) may be seen below.

Flowchart illustrating the update process for Covid19CanadaArchive

At present, manual intervention is occasionally required to ensure data preservation, such as when a dataset and/or website fails to load correctly (this is particularly common for website that rely heavily on JavaScript).

Main areas of improvement

These are the main areas of improvement I see for improving sustainability of the Archive data collection process:

Continued development of the archivist package
- This package was spun off from the original Covid19CanadaArchive directory because I believe it may have useful life as a scraping framework beyond this specific project.
- Adding additionally functionality to reduce manual input as much as possible.
- Improve the portability of the data collection process in case the computing environment changes (e.g., if we move away from using GitHub actions).
Improved index
- The current index is built nightly using indexer.py and served using a (currently undocumented) API via Covid19CanadaAPI.
Computing environment for nightly automated data collection
- As mentioned above, the data collection currently happens automatically through a series of GitHub actions in the Covid19CanadaBot repository, which has been fairly reliable.
- However, there are some caveats:
- "Cron" scheduling of GitHub actions does not work well, so GitHub actions are "manually run" through calls to the GitHub API via a cron job on an external server. This works very well.
- GitHub actions is IP blocked by one website (Montreal public health), and so scraping of a handful of files is run separately on the aforementioned external server via a cron job.
- An external environment may offer more control than offered by GitHub actions.
A web-based management interface
- A web-based management interface for manual interventions/logs/etc. might make it easier for a team to manage, rather than the current maintainer (me) fiddling with a local CLI when things need to be fixed.

Development of the `archivist` packages

A few specific ideas for the development of the archivist package:

A better way of handing dynamic URLs (e.g., datasets for which the URL changes each day and thus a program must be executed first to retrieve the current URL)
- Currently, the list of datasets stores a set of Python (url_fun_python) and R (url_fun_r) commands as a JSON string that may be executed in order to retrieve the current URL
More robust re-run capability to capture datasets that fail/are temporarily not available at standard run time
More robust handling of sites requiring JavaScript/Selenium (e.g., a better way of handling custom code for datasets that require user interaction such as clicking a different tab - something like playwright codegen might be useful here)

ccodwg / FAIRCovid19DataProject