catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Update data validation to accommodate 2009-2010 EIA 860 data #530

Closed zaneselvans closed 4 years ago

zaneselvans commented 4 years ago

Known data validation failures to address:

zaneselvans commented 4 years ago

The net_metering column is read in as a mix of Y and N values, and a bunch of nulls. So it's really a boolean field with NA values (which has only become supported as of Pandas 1.0 (see #536, and #252). It was coded as boolean in the column data type dictionaries, but it wasn't getting cleaned appropriately in the pudl.transform.eia860.plants() function, so none of the values looked like booleans. My guess is that's why it was ending up entirely Null. I've added it to the list of boolean columns that needs to be cleaned, so after another ETL it'll hopefully come out correct. But really it needs to be fixed in #252.

zaneselvans commented 4 years ago

Looking at the number of "raw" (directly from the annual table), "entity" (from the static entity tables for utilities, plants and generators), and the output tables, there are some discrepancies in the number of records that need to be explained -- and which seem to be responsible in some way for the weird numbers of records that are failing in the validation.

utils_raw: 44801
utils_eia860: 44801
utils_entity: 12017

plants_raw: 84828
plants_eia860: 84906
plants_entity: 12375

gens_raw: 242630
gens_eia860: 189297
gens_entity: 31585

own_raw: 42564
own_eia860: 33994

Weirdly there are 78 more plant records in the EIA 860 output table than in the raw EIA 860 plants table -- the extra plants that doesn't show up in the plants table itself appear to be the ones that are harvested form the EIA 923, but don't show up in 860 anywhere. They have NaT values for their report_date column.

Somewhat less weirdly, but still not making sense, there are many more generator records in the raw table (like 53,000 more) than in the output table -- need to see where they're getting dropped / lost, and a difference of ~8,500 between the number of raw and output ownership records. Should step through the output routines in the debugger to see where the records are getting culled and why. It's not immediately obvious from highlighting those records which don't show up in both. the raw and output tables.

zaneselvans commented 4 years ago

Hmm. It looks like at least in the generators_eia860 function, there's information from the plant entity tables being merged in, before the plants_utils_eia information is merged in, which seems weird -- shouldn't all the entity fields be part of the pu_eia function instead? Also, there's a merge that includes both plant_id_eia and plant_name_eia as keys, which seems wrong, as the name is only meant for human readability. It's not a key field.