covidatlas / li

Next-generation serverless crawler for COVID-19 data
Apache License 2.0
57 stars 33 forks source link

Implement data quality monitoring #234

Open jzohrab opened 4 years ago

jzohrab commented 4 years ago

Description.

Currently, we don't know if sources are actually working. Source URLs can disappear or change, and we currently don't have an idea what's working and what's not.

This can be further broken down into two types of monitoring/reporting:

Possible solutions

DONE: Data source checks

Done (initial implementation) at https://api.staging.covidatlas.com/status?format=html.

Data quality checks

Some examples of data quality issues that we can likely monitor and catch early:

https://docs.google.com/document/d/1vwW6XiCGpQPbhMxMhLISwAPbFemr4UIBV8qIqVxdJBM/edit#

These can be broken down into separate issues, but two that would be useful to implement soon would be "Missing Data" and "Zero Data". We would need to have timeseries generation for sources for some of these checks.

Slack reporting

I am not a huge fan of reporting failures into Slack, as such channels can rapidly become noise; however, a daily summary of the current status could be useful, with a link to an actual status reporting page. e.g.,

6 sources could not be crawled.
10 sources failed during scrape.
8 data quality issues.
See dashboard (link)

We could report the full status on the page if we wished:

crawl failed:
- us-or
- us-ca-blah

scrape failed:
- blah
- aoeu
ryanblock commented 4 years ago

This looks solid and I think accurately describes where the status system is going!

jzohrab commented 4 years ago

Opened https://github.com/covidatlas/li/pull/238 to address the "data source checks". "Data quality checks" belongs in its own issue, will move it there once #238 is merged.

jzohrab commented 4 years ago

Source checks implemented, https://api.staging.covidatlas.com/status?format=html is up. :tada: