catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Data validation errors after integrating eia860m 2020-11 #943

Closed zaneselvans closed 2 years ago

zaneselvans commented 3 years ago

After simplifying our test suite setup (issue #942) I ran the data validation tests, to make sure they still worked with the new setup. There were a few tables with more rows than expected because (I think) of the integration of the eia860m data through November 2020. These included plants_eia860, utilities_eia860, pu_eia860, and generators_eia860 which all would be expected to change with the addition of new generators.

However, there were some other data validation failures that don't really make sense. Null distributed_generation column in the MCOE output, and too many records in the generation_fuel_eia923 table, which should be tracked down:

FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gf_eia923-1551264-1250340-104195] - ValueError: Too many records (128817>109404.75) in dataframe gf_eia923
FAILED test/validate/mcoe_test.py::test_no_null_cols_mcoe[eia_annual-mcoe-all] - ValueError: Null column: distributed_generation found in dataframe mcoe
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gf_eia923-1551264-1250340-104195] - ValueError: Too many records (1545804>1312857.0) in dataframe gf_eia923
FAILED test/validate/mcoe_test.py::test_no_null_cols_mcoe[eia_monthly-mcoe-all] - ValueError: Null column: distributed_generation found in dataframe mcoe
zaneselvans commented 2 years ago

I added a list of deprecated columns to the mcoe null columns check, since there some generator_eia860 columns which only have data prior to 2008, the earliest year for which we can calculate the MCOE / fuel costs based on our current methods. This fixes the distributed_generation error.

The generation_fuel row counts have also been investigated and updated.

Both these changes are part of PR #1103