Closed zaneselvans closed 4 years ago
The net_metering
column is read in as a mix of Y
and N
values, and a bunch of nulls. So it's really a boolean field with NA values (which has only become supported as of Pandas 1.0 (see #536, and #252). It was coded as boolean in the column data type dictionaries, but it wasn't getting cleaned appropriately in the pudl.transform.eia860.plants()
function, so none of the values looked like booleans. My guess is that's why it was ending up entirely Null. I've added it to the list of boolean columns that needs to be cleaned, so after another ETL it'll hopefully come out correct. But really it needs to be fixed in #252.
Looking at the number of "raw" (directly from the annual table), "entity" (from the static entity tables for utilities, plants and generators), and the output tables, there are some discrepancies in the number of records that need to be explained -- and which seem to be responsible in some way for the weird numbers of records that are failing in the validation.
utils_raw: 44801
utils_eia860: 44801
utils_entity: 12017
plants_raw: 84828
plants_eia860: 84906
plants_entity: 12375
gens_raw: 242630
gens_eia860: 189297
gens_entity: 31585
own_raw: 42564
own_eia860: 33994
Weirdly there are 78 more plant records in the EIA 860 output table than in the raw EIA 860 plants table -- the extra plants that doesn't show up in the plants table itself appear to be the ones that are harvested form the EIA 923, but don't show up in 860 anywhere. They have NaT
values for their report_date
column.
Somewhat less weirdly, but still not making sense, there are many more generator records in the raw table (like 53,000 more) than in the output table -- need to see where they're getting dropped / lost, and a difference of ~8,500 between the number of raw and output ownership records. Should step through the output routines in the debugger to see where the records are getting culled and why. It's not immediately obvious from highlighting those records which don't show up in both. the raw and output tables.
Hmm. It looks like at least in the generators_eia860
function, there's information from the plant entity tables being merged in, before the plants_utils_eia information is merged in, which seems weird -- shouldn't all the entity fields be part of the pu_eia function instead? Also, there's a merge that includes both plant_id_eia and plant_name_eia as keys, which seems wrong, as the name is only meant for human readability. It's not a key field.
Known data validation failures to address:
eia860_test
21 generator-years have incomplete ownership data. This is just a hair over the 0.1% maximum incomplete threshold we set, so I bumped the threshold to 0.2%.eia_test.minmax_rows::utils_eia860
(modest increase in the total number of utilities, not too surprising, adjusted expected number of rows up from 35,000 to 40,000)frc_eia923.test_vs_bounds
kerosene heat content (make min 5.4 not 5.6)mcoe_test.py::test_fuel_cost_per_mmbtu
coal fuel cost per mmbtu min dropped to 1.2 from 1.25eia_test.py::test_no_null_cols_eia
failing onplants_eia860
Maybe there's a bad merge or some column naming inconsistency coming up w/ the older spreadsheets? Turns out the null column isnet_metering
and while there aren't a whole lot of non-null entries in the original spreadsheets for this field, there's definitely more than zero of them.eia_test.minmax_rows::own_eia860
Too few rows?! Only 34,000 found when ~35,000 expected. How can adding 2 additional years of data decrease the number of ownership records?eia_test.minmax_rows::gens_eia860
Too few rows?! Only 189,000 when 200,000+ expected. How can adding 2 more years result in fewer generators?