create a protocol for dealing with inconsistent normalized EIA values

We have had a fair amount of inquiries about null values in EIA tables. Because many of the fields in 860 and 923 are duplicative (found in many tables and across each years), we decided to do some systematic normalization. We chose to pull the most consistent record as reported across all the EIA tables and years, but we also required a "strictness" level of 70%. That means at least 70% of the records must be the same for us to use that value. So if values for an entity haven't been reported 70% consistently, then it will show up as a null value. I built in the ability to add special cases for columns where we want to apply a different method to, but the only ones we added was for latitude and longitude because they are by far the dirtiest.

Overall this results in harvested records for upwards of 98% when all of the years are being ingested, and higher rates for single years of data. Every column but the lat/long columns must produce 95% of acceptable values otherwise the initialization would fail due to assertions in the code. Because lat/long is much more messy we set this constraint to 92% for those columns.

Because we've had so many inquires about these null values or incorrect values (#325, #339, #309) and everyone seems to want these values filled in, I'd propose to include either:

one of the most recently reported values (which will be a little tricky when the column appears in many tables)
pull whatever the most consistently reported value is... this would be kind of a bummer for the cases when something was reported for a while and recently updated/corrected.
For the "static" values, a potentially more involved method would be to create a system for potentially slow changing records. I'm not really sure why this happens with lat/long, but it seems common for a lat/long for a plant to change after a few years and stay that way. We could have a table with data ranges for these type of records.

I'm also very open to other ideas. None of these would be perfect, so I'd propose including a column in all of the harvested tables which includes a list of columns that includes potentially suspicious records.

catalyst-cooperative / pudl

create a protocol for dealing with inconsistent normalized EIA values #446