catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Data sanity checks for EIA 923 #275

Closed zaneselvans closed 4 years ago

zaneselvans commented 5 years ago

For each of the EIA 923 data tables, we need to create a suite of data validity tests -- at least high level sanity checks -- that can be run to ensure nothing weird has happened that's affected the content of the dataset. These can include checking for excessive outlier values, ensuring that median values are within an expected range, etc. See the test_fbp_ferc1() function for some examples.

Tables that need this kind of sanity check include at least:

Raw (not aggregated by month/year):

Aggregated (by month / year):

zaneselvans commented 4 years ago

Okay, within these EIA 923 tables, it looks like mostly there are columns that have constrained values which we should be checking. Are there structural things or other kinds of checks that should be done in this data? Things that wouldn't already be getting checked based on the database structure and lists of allowable values?

zaneselvans commented 4 years ago

Also... can these tests be run on a the aggregated versions of the dataframe? Monthly / Annual? Are these values still included and expected to be valid?

zaneselvans commented 4 years ago

For the types of data validation tests we've got written up, there doesn't really seem to be anything to test in the gen_eia923 table. I could imagine taking some plant or generator information from EIA860, and verifying whether the amount of net generation is plausible given the plant/generator that it's associated with, but that also seems like something that might be better to check in the MCOE / capacity factor validation.

zaneselvans commented 4 years ago

Okay, calling this closed for now. Underlying data issues that were revealed (but not yet addressed) include: