Closed zaneselvans closed 2 years ago
Looking at the raw data shows that these 'duplicate' boiler rows refer to different prime movers (CA and CT) connected to the same boiler. This only applies to combined cycle plants, which is why they are all gas. The prime mover column is absent from the pudl.output.eia923.boiler_fuel_eia923
table, which is why it was hard to discern.
I interpret the boiler_fuel table as a denormalized list of graph edges connecting three things: fuels, prime movers, and boilers. But because most prime_mover to boiler relationships happen to be 1:1, and the EIA title is "boiler_fuels", I think we mistakenly dropped the prime mover column and have been presenting it as a table only about boiler to fuel relationships. I think prime mover must be part of the primary key, and probably this table should be split into two separate edge lists: fuels to boilers, and boilers to prime_movers (I'd have to think more about that).
We can certainly aggregate to boilers, but it changes the meaning of the table. We would lose the ability to distinguish CT energy inputs from (I believe) CA supplementary boiler energy inputs.
Oh wow. Whooops. I think that would make it entirely analogous to the generation_fuel_eia923
table @cmgosnell, which would mean that we can probably use exactly the same methodologies for allocating fuel consumption to boilers that we've applied to allocating net generation to generators. Does that sound right?
Is the prime_mover_code
column present in the database itself? Did we just mess this up in the output layer?
It seems pretty weird that there's only 4000 records out of 1 million where this behavior is actually observed though.
Err.. false alarm!
After looking closer, I'm leaning towards duplicates being a reporting artifact in the boiler_id
s, NOT that prime_mover_code
should be in the key. This is for a couple reasons:
Maybe there is a physical reason that a CA and CT could report separate fuel usage yet share a boiler/combustor, but I struggle to think of one. Somehow the CT would have to provide auxiliary heat that bypasses its own turbine. That's physically plausible but the timing seems unlikely (sudden onset across 40 pre-existing plants in 2014).
A note on scale: I found 22k duplicate keys (2%), but I think it is more useful to count plants or plant-years than plant-boiler-fuel-years. From 2014-2019, around 40 plant_id
s per year (3% of plant_id
s, 4.5% of combined cycle plant_ids
) contain at least one duplicate. In earlier years it's <1% of all plants.
What To Do: Uniqueness, consistency, convenience - pick two. Some options are:
prime_mover_code
to the key. That seems like a misuse of that column and could confuse people.pudl_boiler_id
based on the composite (boiler_id, prime_mover_code)
. I think the duplicate CA/CT pairs are really supposed to have different boiler IDs; the surrogate ID could proceed from that assumption. The only other tables we have with boiler_id
are boiler_generator_assn_eia860
and boilers_entity_eia
, both of which are derived. So we could adjust them too. This seems like the most complete but most expensive solution.I think aggregating it is the best option. There is real data in the duplicate keys, but only a few percent of the total. Half the CA fuel values are 0s anyway, so there is even less impact than the plant_id fraction suggests.
They definitely had a big shift in how combined cycle "boilers" are treated in the reporting in 2014-2015. IIRC previously there just wasn't any data for the most part because a CT isn't a "boiler" even though that's where most of the fuel gets burned (which results in ridiculous heat rates for the generators since only supplemental post-CT fuel gets reported as a fuel input).
In typical operations I think it's extremely difficult to disaggregate fuel consumption from the combustion and steam turbine portions of the CCGT. They run just the gas turbine sometimes as a peaking unit I think, but aggregating across the two prime movers wouldn't harm anything there I don't think would it? You'd just get an (accurate) lower thermal efficiency (higher heat rate) as a consequence.
I'm not sure but I could imagine this having implications in the unit_id_pudl
assignment process I was working on earlier this year for heat rate calculations. All of that code is in the pudl.output.eia860
module if you want to skim it. It depends on the reported prime mover codes for CCGT components to identify connected components / units, augmenting the boiler_generator_assn_eia860
table (and probably ultimately getting incorporated into that table somehow).
What kind of graph edge are you thinking of being destroyed here?
The HRSGs sometimes (about half of them, based on the rate of non-zero CA fuel values in this table) have auxiliary heating capability, so do have their own fuel consumption independent of the CT. This low-efficiency supplementary heat seems to be used for seasonal peaking, based on the monthly fuel usage shapes I saw. It is generally much smaller than the fuel used by the CTs, like 1-2 orders of magnitude.
Also, the raw 860 Generator spreadsheet has a column Can Bypass Heat Recovery Steam Generator?
(we don't include it in pudl for some reason) that shows that only about 30% of CTs can run as standalone GTs as of 2019.
I mentioned graph edge deletion in case it impacted connected-component/subplant identification. I guess it's really graph node deletion - aggregating a CT and CA together into a single node distorts the graph. I'll take a look at the BGA and unit_id processes for interference.
The 860 Generator spreadsheet has another column duct burners
(again not in pudl) that indicate the capability of supplemental heating of HRSGs. In 2019 65% of CAs and CSs had duct burners.
This made me think of another possiblity: maybe the duplicate plants are actually mislabeled CS?
I'm seeing both duct_burners
and bypass_steam_recovery
in the field metadata, and both are part of the generators_entity_eia
table (meaning they are static generator attributes and don't change year to year). Where are you looking and not finding them? I've got about 9500 generator entity records with duct_burners==True
in my PUDL DB.
oh I was looking in generators_eia860
where a bunch of the other columns from the spreadsheet were located. I didn't realize those were normalized out.
The
boiler_fuel_eia923
should have a natural primary key of:report_date
plant_id_eia
boiler_id
fuel_type_code
However, there are 4032 rows out of 1.14 million that actually do contain data, and are duplicates on the basis of that primary key. 100% of them are natural gas, and contain no information about ash or sulfur content -- they are nothing but mmbtu. We could definitely aggregate them into unique rows with the above natural primary key.
Should we do that? Should the be summed? Might they actually be reporting errors in which just one of the records should be retained?
This came out of Issue #846