catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Data sanity checks for EIA 860 #276

Closed zaneselvans closed 4 years ago

zaneselvans commented 5 years ago

For each of the EIA 860 data tables, we need to create a suite of data validity tests -- at least high level sanity checks -- that can be run to ensure nothing weird has happened that's affected the content of the dataset. These can include checking for excessive outlier values, ensuring that median values are within an expected range, etc. See the test_fbp_ferc1() function for some examples.

Output tables that should be sanity checked include:

gschivley commented 4 years ago

I found lots of duplicate plant/generators in boiler_generator_assn_eia860, especially among combined cycle plants. For example, the results from

SELECT * FROM boiler_generator_assn_eia860
WHERE plant_id_eia = 117
AND DATE_PART('year', report_date)=2017

include

image

In this case many generators are listed with 2 different boilers. Not sure if this is the expected behavior but I didn't anticipate it. I merged this in with the 860 generator data so that I could get heat rates by unit and capacity by unit (where possible). As a result the merged dataframes included many duplicate generator lines and my total capacity for each unit was significantly inflated.

zaneselvans commented 4 years ago

@cmgosnell can probably speak to this more authoritatively, but I believe this is the expected behavior. The real inseparable generation units, in which the fuel input and electricity output are intermingled and inseparable (insofar as the data is available from EIA), are the unit_id_pudl values. So this pant has 5 total units, 2 of which are combined cycle. I think the large number of different combinations is probably a result of the combined cycle cases where the "boiler" and "generator" are really the same thing, and the generator_id values are also getting added into the list of boiler_id values.

This particular table isn't meant so much for direct use, as it is meant to provide information about the association / separability of the generation units for the purposes of calculating heat rates, attirbuting fuel costs, etc. You probably only ever want to merge this table with another one using plant_id_eia and one of either generator_id or boiler_id with the intention of pulling in the unit_id_pudl that is associated with that boiler, or that generator.

cmgosnell commented 4 years ago

Okay, I'm going to close this issue because we finished a preliminary round of sanity checks for 860. I'm making a lot of comments below to respond to @zaneselvans 's questions.

  • [ ] plants_eia860
    • zip_code being treated as integers, when it needs to be a string to preserve (meaningful) leading zeroes.

Should be straightforward to fix. Made Issue #404.

  • iso_rto_name seems... wrong. It's Y/N/None, not a name. If these values are the right ones then it should be a boolean, not a string (True/False/NA).

We never cleaned up this column/transitioned it to a true boolean.

  • grid_voltage_kv is sometimes NaN when grid_voltage_2_kv` has a value. Seems like an error -- maybe harvesting process overwrote the first one b/c it wasn't consistent enough?

It is almost certainly the harvesting of grid_voltage_kv (which is much much more reported than the grid_voltage_n_kv records),

  • primary_purpose_naics_id and sector_id are really integers, but are getting turned into floats because of some NaN values. Should be made into ExtensionArray types at some point (nullable integers)

I made a new issue to deal with this (Issue #403)

  • Many columns using None as the NA value. Is that correct? Lots of them are boolean or string type columns.

^ samezies

  • transmission_distribution_owner_id is an integer, turning into float b/c of missing values.

^ samezies

  • utility_id_eia missing from many entries. Seems wrong.

That does seem odd.

^ samezies

  • [ ] utils_eia860
    • Some utilities have NA for util_id_pudl which is turning the column into floats, when it should be integers -- also, should we not be assigning a PUDL utility ID to every utility in the id_mapping process?

^ samezies on null values, and we do not require all EIA utilities to have a util_id_pudl. FERC utilities need id's because they don't really have anything better, but we don't require EIA id's. Also, we made id's for 923 utilities, not 860 (which is the much more extensive collection, but includes many many more tiny utilities).

  • Some utility_name fields are None ??

This is a harvested field, so when it is inconsistent, this happens.

  • Zip codes are being interpreted as integers, which means leading zeroes are lost, making them invalid. I think zip codes need to be treated as strings.

I fixed this!

  • entity_type is an inscrutable code.

Yes, yes it is. We didn't really clean this column.

  • [ ] gens_eia860
    • summer_estimated_capability_mw uses None as NA value, instead of NaN (but it's a numeric field).
    • Compare summer_capacity_mw, winter_capacity_mw and capacity_mw to see if they're the same / similar.
    • ownership_code is mysterious. Can we make them readable?

Mkay, samzies again on the null values. And the ownership_code is another field that we haven't yet touched or cleaned.

  • [ ] bga_eia860
    • Is there some kind of testing that we should be doing here to ensure that there are no incompatibilities in the grouping of PUDL unit IDs and EIA unit IDs? There's a warning that comes up during the ETL process that gets at this issue. Seems like it should go here too (or maybe, instead).

This would be a little tricky because we are not storing/keeping the consistency columns to do this test after the fact. I'm sure it is possible but we would have to get a little creative to interpret which entities have consistent records.

  • [ ] own_eia860
    • zip_code field is integer, losing leading zeroes.

addressed above ( Issue #404.)

  • Is there a foreign key relationship between owner and utility tables? They use the same IDs I think.

I've never investigated this. I would assume this is true but I actually don't know.

  • Already checking that the ownership proportions add up to ~100%.

whoop