Closed zaneselvans closed 4 years ago
I found lots of duplicate plant/generators in boiler_generator_assn_eia860
, especially among combined cycle plants. For example, the results from
SELECT * FROM boiler_generator_assn_eia860
WHERE plant_id_eia = 117
AND DATE_PART('year', report_date)=2017
include
In this case many generators are listed with 2 different boilers. Not sure if this is the expected behavior but I didn't anticipate it. I merged this in with the 860 generator data so that I could get heat rates by unit and capacity by unit (where possible). As a result the merged dataframes included many duplicate generator lines and my total capacity for each unit was significantly inflated.
@cmgosnell can probably speak to this more authoritatively, but I believe this is the expected behavior. The real inseparable generation units, in which the fuel input and electricity output are intermingled and inseparable (insofar as the data is available from EIA), are the unit_id_pudl
values. So this pant has 5 total units, 2 of which are combined cycle. I think the large number of different combinations is probably a result of the combined cycle cases where the "boiler" and "generator" are really the same thing, and the generator_id
values are also getting added into the list of boiler_id
values.
This particular table isn't meant so much for direct use, as it is meant to provide information about the association / separability of the generation units for the purposes of calculating heat rates, attirbuting fuel costs, etc. You probably only ever want to merge this table with another one using plant_id_eia
and one of either generator_id
or boiler_id
with the intention of pulling in the unit_id_pudl
that is associated with that boiler, or that generator.
Okay, I'm going to close this issue because we finished a preliminary round of sanity checks for 860. I'm making a lot of comments below to respond to @zaneselvans 's questions.
- [ ]
plants_eia860
zip_code
being treated as integers, when it needs to be a string to preserve (meaningful) leading zeroes.
Should be straightforward to fix. Made Issue #404.
iso_rto_name
seems... wrong. It's Y/N/None, not a name. If these values are the right ones then it should be a boolean, not a string (True/False/NA).
We never cleaned up this column/transitioned it to a true boolean.
grid_voltage_kv
is sometimesNaN
when grid_voltage_2_kv` has a value. Seems like an error -- maybe harvesting process overwrote the first one b/c it wasn't consistent enough?
It is almost certainly the harvesting of grid_voltage_kv
(which is much much more reported than the grid_voltage_n_kv
records),
primary_purpose_naics_id
andsector_id
are really integers, but are getting turned into floats because of someNaN
values. Should be made into ExtensionArray types at some point (nullable integers)
I made a new issue to deal with this (Issue #403)
- Many columns using
None
as the NA value. Is that correct? Lots of them are boolean or string type columns.
^ samezies
transmission_distribution_owner_id
is an integer, turning into float b/c of missing values.
^ samezies
utility_id_eia
missing from many entries. Seems wrong.
That does seem odd.
plant_id_pudl
and utility_id_pudl
turning into floats because of missing values.^ samezies
- [ ]
utils_eia860
- Some utilities have
NA
forutil_id_pudl
which is turning the column into floats, when it should be integers -- also, should we not be assigning a PUDL utility ID to every utility in the id_mapping process?
^ samezies on null values, and we do not require all EIA utilities to have a util_id_pudl
. FERC utilities need id's because they don't really have anything better, but we don't require EIA id's. Also, we made id's for 923 utilities, not 860 (which is the much more extensive collection, but includes many many more tiny utilities).
- Some
utility_name
fields areNone
??
This is a harvested field, so when it is inconsistent, this happens.
- Zip codes are being interpreted as integers, which means leading zeroes are lost, making them invalid. I think zip codes need to be treated as strings.
I fixed this!
entity_type
is an inscrutable code.
Yes, yes it is. We didn't really clean this column.
- [ ]
gens_eia860
summer_estimated_capability_mw
usesNone
as NA value, instead ofNaN
(but it's a numeric field).- Compare
summer_capacity_mw
,winter_capacity_mw
andcapacity_mw
to see if they're the same / similar.ownership_code
is mysterious. Can we make them readable?
Mkay, samzies again on the null values. And the ownership_code
is another field that we haven't yet touched or cleaned.
- [ ]
bga_eia860
- Is there some kind of testing that we should be doing here to ensure that there are no incompatibilities in the grouping of PUDL unit IDs and EIA unit IDs? There's a warning that comes up during the ETL process that gets at this issue. Seems like it should go here too (or maybe, instead).
This would be a little tricky because we are not storing/keeping the consistency columns to do this test after the fact. I'm sure it is possible but we would have to get a little creative to interpret which entities have consistent records.
- [ ]
own_eia860
zip_code
field is integer, losing leading zeroes.
addressed above ( Issue #404.)
- Is there a foreign key relationship between
owner
andutility
tables? They use the same IDs I think.
I've never investigated this. I would assume this is true but I actually don't know.
- Already checking that the ownership proportions add up to ~100%.
whoop
For each of the EIA 860 data tables, we need to create a suite of data validity tests -- at least high level sanity checks -- that can be run to ensure nothing weird has happened that's affected the content of the dataset. These can include checking for excessive outlier values, ensuring that median values are within an expected range, etc. See the
test_fbp_ferc1()
function for some examples.Output tables that should be sanity checked include:
plants_eia860
zip_code
being treated as integers, when it needs to be a string to preserve (meaningful) leading zeroes.iso_rto_name
seems... wrong. It's Y/N/None, not a name. If these values are the right ones then it should be a boolean, not a string (True/False/NA).grid_voltage_kv
is sometimesNaN
when grid_voltage_2_kv` has a value. Seems like an error -- maybe harvesting process overwrote the first one b/c it wasn't consistent enough?primary_purpose_naics_id
andsector_id
are really integers, but are getting turned into floats because of someNaN
values. Should be made into ExtensionArray types at some point (nullable integers)None
as the NA value. Is that correct? Lots of them are boolean or string type columns.transmission_distribution_owner_id
is an integer, turning into float b/c of missing values.utility_id_eia
missing from many entries. Seems wrong.plant_id_pudl
andutility_id_pudl
turning into floats because of missing values.utils_eia860
NA
forutil_id_pudl
which is turning the column into floats, when it should be integers -- also, should we not be assigning a PUDL utility ID to every utility in the id_mapping process?utility_name
fields areNone
??entity_type
is an inscrutable code.gens_eia860
summer_estimated_capability_mw
usesNone
as NA value, instead ofNaN
(but it's a numeric field).summer_capacity_mw
,winter_capacity_mw
andcapacity_mw
to see if they're the same / similar.ownership_code
is mysterious. Can we make them readable?bga_eia860
own_eia860
zip_code
field is integer, losing leading zeroes.owner
andutility
tables? They use the same IDs I think.