GoogleCloudPlatform / covid-19-open-data

Datasets of daily time-series data related to COVID-19 for over 20,000 distinct locations around the world.
Apache License 2.0
472 stars 131 forks source link

Automatic sanity check of data, flagging out-of-range or suspiciously large changes #452

Open geening opened 3 years ago

geening commented 3 years ago

Propose a system that on each run of the pipeline would sanity check data for values that are out of a reasonable range, or with a suspiciously large change from one run of the pipeline to the next. Ideally checks would apply to data at all stages along the pipeline -- input sources, intermediate data, as well as generated data (output cell indexed by table/variable/key/date) -- but we could start by implementing where this is easiest. The results would be reported in a pipeline status report and/or stored (either appending to a log, or in a more structured format or database) for future reference (for instance, when suspicious data is manually discovered, one could look to see when it was introduced).

Some errors that have come up that would likely be caught by such a system: Regions with confirmed cases > population Regions with area > area of earth

winwiz1 commented 3 years ago

Related issue for epidemiology data: #186