GoogleCloudPlatform / covid-19-open-data

Datasets of daily time-series data related to COVID-19 for over 20,000 distinct locations around the world.
Apache License 2.0
471 stars 130 forks source link

Data validation #186

Open winwiz1 opened 3 years ago

winwiz1 commented 3 years ago

Would be good to have a staging bucket and copy files from there to the current production bucket after the data has been validated by a data validation utility, for example like this one or similar.

The set of validation checks would address the existing data irregularities issues and cover the space where future issues could develop.

owahltinez commented 3 years ago

Thanks for the suggestion! This is something we are actively looking into. A couple of notes:

Would be good to have a staging bucket and copy files from there to the current production bucket after the data has been validated by a data validation utility

We already have a staging bucket and copy the files over to the production bucket. The issue with doing validation at this level is that it's all-or-nothing, and we would not want to block updating the entire output because a handful of inputs look off. We also wouldn't want to "fix" data at this stage since it may mask issues with underlying data sources.

for example like this one or similar

That looks very interesting! The claims around performance seem great too. We would be open to implementing something similar, but I would suggest doing it at the individual data source level rather than the final output. If this is something you are interested in, we would definitely welcome your contributions. My only request is having a high-level discussion about design first, before submitting a big PR :-)

The set of validation checks would address the existing data irregularities issues and cover the space where future issues could develop.

We are aware of some of the data irregularities, but please don't feel discouraged from opening an issue anytime you see something that looks off!

winwiz1 commented 3 years ago

Thanks for the feedback!

The issue with doing validation at this level is that it's all-or-nothing, and we would not want to block updating the entire output because a handful of inputs look off.

I’d assume it doesn’t have to be all-or-nothing and that is why the data validation utility I referred to has configurable thresholds. The idea is stop issues like issue #59 where the final output was truncated from containing data for ~200 countries to one country’s data. Unless something major like this happens, the utility will happily return zero exit code indicating success and produce (as a side benefit) error files with rejected epidemiology and index rows for review. The rejections can be treated as warnings.

For instance, right now the side benefit includes flagging the index LY_SR as invalid because it has the country_name field set to SR. The set of checks mentioned in the README is meant to detect regressions related to existing (and fixed) data irregularity issues along with attempting to proactively foresee and cover future problems.

We would be open to implementing something similar, but I would suggest doing it at the individual data source level rather than the final output.

I’d recommend considering both types of checks as complimentary to each other.

When it comes to epidemiology data, most use cases will need both epidemiology and index files because case counts from the former are less meaningful without the ability to attribute those to a particular region described in the latter. Therefore it would make sense to consider both files as a single unit of testing/deployment. Checks on one file (including ones performed at the individual data source level) can be invalidated, as far as end user is concerned, by last minute changes affecting the other file. Performing testing on both files immediately prior to the deployment into production would give everyone more confidence that this won’t happen.

owahltinez commented 3 years ago

Good point, let me clarify. When I say "all or nothing" I mean that we would not want to selectively filter outputs at the final stage, either we copy them or we don't. That said, it's perfectly valid to run some validation on those outputs, so we can spot issues and fix them at the data source -- the LY_SR key is a perfect example of that.

So the validation can be run with different thresholds, some can be considered an error which would prevent copying altogether (like #59), some can be considered a warning which would result in an issue being open for us to investigate.

winwiz1 commented 3 years ago

If you'd like to go ahead with integrating the utility, then its repository can be cloned/copied and the existing functionality can be used as is, I assume there is no need in PR. Please let me know if this assumption is incorrect or you’d like to change the utility and make it more suitable for testing of the unit comprised of the index and epidemiology files.