catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

create a protocol for dealing with inconsistent normalized EIA values #446

Open cmgosnell opened 5 years ago

cmgosnell commented 5 years ago

We have had a fair amount of inquiries about null values in EIA tables. Because many of the fields in 860 and 923 are duplicative (found in many tables and across each years), we decided to do some systematic normalization. We chose to pull the most consistent record as reported across all the EIA tables and years, but we also required a "strictness" level of 70%. That means at least 70% of the records must be the same for us to use that value. So if values for an entity haven't been reported 70% consistently, then it will show up as a null value. I built in the ability to add special cases for columns where we want to apply a different method to, but the only ones we added was for latitude and longitude because they are by far the dirtiest.

Overall this results in harvested records for upwards of 98% when all of the years are being ingested, and higher rates for single years of data. Every column but the lat/long columns must produce 95% of acceptable values otherwise the initialization would fail due to assertions in the code. Because lat/long is much more messy we set this constraint to 92% for those columns.

Because we've had so many inquires about these null values or incorrect values (#325, #339, #309) and everyone seems to want these values filled in, I'd propose to include either:

I'm also very open to other ideas. None of these would be perfect, so I'd propose including a column in all of the harvested tables which includes a list of columns that includes potentially suspicious records.

zaneselvans commented 2 years ago

After the harvesting has been refactored, we need to revisit this and probably turn it into an epic, with a bunch of per-column issues -- since we'll need to design aggregation / consolidation functions on a per-column basis.