Implement data quality monitoring

jzohrab commented 4 years ago

Description.

Currently, we don't know if sources are actually working. Source URLs can disappear or change, and we currently don't have an idea what's working and what's not.

This can be further broken down into two types of monitoring/reporting:

DONE: Data source check: Each data source (URL) may be missing or present, and the data either scraped successfully or not.
Data quality check: a given data source may exist and be scraped, but the results may be nonsense. We can likely catch some of these issues.

Possible solutions

DONE: Data source checks

Done (initial implementation) at https://api.staging.covidatlas.com/status?format=html.

Data quality checks

Some examples of data quality issues that we can likely monitor and catch early:

https://docs.google.com/document/d/1vwW6XiCGpQPbhMxMhLISwAPbFemr4UIBV8qIqVxdJBM/edit#

These can be broken down into separate issues, but two that would be useful to implement soon would be "Missing Data" and "Zero Data". We would need to have timeseries generation for sources for some of these checks.

Slack reporting

I am not a huge fan of reporting failures into Slack, as such channels can rapidly become noise; however, a daily summary of the current status could be useful, with a link to an actual status reporting page. e.g.,

6 sources could not be crawled.
10 sources failed during scrape.
8 data quality issues.
See dashboard (link)

We could report the full status on the page if we wished:

crawl failed:
- us-or
- us-ca-blah

scrape failed:
- blah
- aoeu

ryanblock commented 4 years ago

This looks solid and I think accurately describes where the status system is going!

jzohrab commented 4 years ago

Opened https://github.com/covidatlas/li/pull/238 to address the "data source checks". "Data quality checks" belongs in its own issue, will move it there once #238 is merged.

jzohrab commented 4 years ago

Source checks implemented, https://api.staging.covidatlas.com/status?format=html is up. :tada:

covidatlas / li