datacommonsorg / import

Tools and pipelines for importing data into the Data Commons Knowledge Graph.
Apache License 2.0
5 stars 21 forks source link

Have an stats-checker option to validate all the places #68

Closed pradh closed 3 years ago

pradh commented 3 years ago

We ran into the "inconsistent stats" bug with EPA import which only happens for a small fraction of places - https://dashboards.corp.google.com/_51d0e7ea_3666_4fe8_82a6_acbf24d2585f?p=CNSPATH:%2Fcns%2Fjv-d%2Fhome%2Fdatcom%2Fimport_server%2Finternal%2Fvalidate_EPA_GHGRP_Validation%2FValidationProto@100.

It is not a very large import, so for those, might be nice to just run through all places in the import... or some significant fraction of it.

pradh commented 3 years ago
pradh commented 3 years ago

India census imports are 30GB (even before Census Tracts/BlockGroups), so doing the current logic for everything won't scale.

And really, the only check we care about for detecting bugs is the "inconsistent stats" bug. So I'm wondering about just having one single set of 64-bit hashes computed from {place, stat-var, unit, ..., date} without numeric value, and as we process SVObs, we do this check in a streaming way, and if there is a hit, then we know the offending Place ID, SV ID and CSV row, which we report as normal ERROR log (and we could even aggressively fail for it).

At 1B input nodes (India team is doing ~200M in one dataset now), it is still 8GB, which seems okay.

Just reporting the Place ID and SV ID might help users debug, but they can also re-run the stats-checker with that Place ID to get more details.

WDYT?

chejennifer commented 3 years ago

This makes sense to me, i can work on this next!

beets commented 3 years ago

+1, that sounds good to me. Thanks!