Closed pradh closed 3 years ago
India census imports are 30GB (even before Census Tracts/BlockGroups), so doing the current logic for everything won't scale.
And really, the only check we care about for detecting bugs is the "inconsistent stats" bug. So I'm wondering about just having one single set of 64-bit hashes computed from {place, stat-var, unit, ..., date} without numeric value, and as we process SVObs, we do this check in a streaming way, and if there is a hit, then we know the offending Place ID, SV ID and CSV row, which we report as normal ERROR log (and we could even aggressively fail for it).
At 1B input nodes (India team is doing ~200M in one dataset now), it is still 8GB, which seems okay.
Just reporting the Place ID and SV ID might help users debug, but they can also re-run the stats-checker with that Place ID to get more details.
WDYT?
This makes sense to me, i can work on this next!
+1, that sounds good to me. Thanks!
We ran into the "inconsistent stats" bug with EPA import which only happens for a small fraction of places - https://dashboards.corp.google.com/_51d0e7ea_3666_4fe8_82a6_acbf24d2585f?p=CNSPATH:%2Fcns%2Fjv-d%2Fhome%2Fdatcom%2Fimport_server%2Finternal%2Fvalidate_EPA_GHGRP_Validation%2FValidationProto@100.
It is not a very large import, so for those, might be nice to just run through all places in the import... or some significant fraction of it.