catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Modify boiler_fuel_eia923 transform to yield natural primary key #852

Closed zaneselvans closed 2 years ago

zaneselvans commented 3 years ago

The boiler_fuel_eia923 should have a natural primary key of:

However, there are 4032 rows out of 1.14 million that actually do contain data, and are duplicates on the basis of that primary key. 100% of them are natural gas, and contain no information about ash or sulfur content -- they are nothing but mmbtu. We could definitely aggregate them into unique rows with the above natural primary key.

Should we do that? Should the be summed? Might they actually be reporting errors in which just one of the records should be retained?

This came out of Issue #846

TrentonBush commented 2 years ago

Looking at the raw data shows that these 'duplicate' boiler rows refer to different prime movers (CA and CT) connected to the same boiler. This only applies to combined cycle plants, which is why they are all gas. The prime mover column is absent from the pudl.output.eia923.boiler_fuel_eia923 table, which is why it was hard to discern.

I interpret the boiler_fuel table as a denormalized list of graph edges connecting three things: fuels, prime movers, and boilers. But because most prime_mover to boiler relationships happen to be 1:1, and the EIA title is "boiler_fuels", I think we mistakenly dropped the prime mover column and have been presenting it as a table only about boiler to fuel relationships. I think prime mover must be part of the primary key, and probably this table should be split into two separate edge lists: fuels to boilers, and boilers to prime_movers (I'd have to think more about that).

We can certainly aggregate to boilers, but it changes the meaning of the table. We would lose the ability to distinguish CT energy inputs from (I believe) CA supplementary boiler energy inputs.

zaneselvans commented 2 years ago

Oh wow. Whooops. I think that would make it entirely analogous to the generation_fuel_eia923 table @cmgosnell, which would mean that we can probably use exactly the same methodologies for allocating fuel consumption to boilers that we've applied to allocating net generation to generators. Does that sound right?

Is the prime_mover_code column present in the database itself? Did we just mess this up in the output layer?

It seems pretty weird that there's only 4000 records out of 1 million where this behavior is actually observed though.

TrentonBush commented 2 years ago

Err.. false alarm!

After looking closer, I'm leaning towards duplicates being a reporting artifact in the boiler_ids, NOT that prime_mover_code should be in the key. This is for a couple reasons:

  1. The vast majority (~95%) of combined cycle plants use separate boiler IDs for the CA and CT parts.
  2. It seems like there was a big change in reporting requirements that perhaps caused errors that were never revisited. There is a step change in 2014 after which most of the duplicates occur. This coincides with a step change in the number of rows with the CT prime mover code, from 350 in 2013 to >1600 in 2014. Of the 77 plants that ever had a duplicate (2009-2019), 32 of them had a duplicate in only one or two years (seems accidental), and the 45 long term offenders almost all started in 2014.
  3. The table is called boiler_fuel!

Maybe there is a physical reason that a CA and CT could report separate fuel usage yet share a boiler/combustor, but I struggle to think of one. Somehow the CT would have to provide auxiliary heat that bypasses its own turbine. That's physically plausible but the timing seems unlikely (sudden onset across 40 pre-existing plants in 2014).

A note on scale: I found 22k duplicate keys (2%), but I think it is more useful to count plants or plant-years than plant-boiler-fuel-years. From 2014-2019, around 40 plant_ids per year (3% of plant_ids, 4.5% of combined cycle plant_ids) contain at least one duplicate. In earlier years it's <1% of all plants.

What To Do: Uniqueness, consistency, convenience - pick two. Some options are:

  1. Aggregate it away and have somewhat inconsistent data for ~40 plants. This blends CA and CT fuel info together and destroys a graph edge
  2. Preserve all the info by adding prime_mover_code to the key. That seems like a misuse of that column and could confuse people.
  3. Preserve all the info by creating a surrogate pudl_boiler_id based on the composite (boiler_id, prime_mover_code). I think the duplicate CA/CT pairs are really supposed to have different boiler IDs; the surrogate ID could proceed from that assumption. The only other tables we have with boiler_id are boiler_generator_assn_eia860 and boilers_entity_eia, both of which are derived. So we could adjust them too. This seems like the most complete but most expensive solution.

I think aggregating it is the best option. There is real data in the duplicate keys, but only a few percent of the total. Half the CA fuel values are 0s anyway, so there is even less impact than the plant_id fraction suggests.

zaneselvans commented 2 years ago

They definitely had a big shift in how combined cycle "boilers" are treated in the reporting in 2014-2015. IIRC previously there just wasn't any data for the most part because a CT isn't a "boiler" even though that's where most of the fuel gets burned (which results in ridiculous heat rates for the generators since only supplemental post-CT fuel gets reported as a fuel input).

In typical operations I think it's extremely difficult to disaggregate fuel consumption from the combustion and steam turbine portions of the CCGT. They run just the gas turbine sometimes as a peaking unit I think, but aggregating across the two prime movers wouldn't harm anything there I don't think would it? You'd just get an (accurate) lower thermal efficiency (higher heat rate) as a consequence.

I'm not sure but I could imagine this having implications in the unit_id_pudl assignment process I was working on earlier this year for heat rate calculations. All of that code is in the pudl.output.eia860 module if you want to skim it. It depends on the reported prime mover codes for CCGT components to identify connected components / units, augmenting the boiler_generator_assn_eia860 table (and probably ultimately getting incorporated into that table somehow).

What kind of graph edge are you thinking of being destroyed here?

TrentonBush commented 2 years ago

The HRSGs sometimes (about half of them, based on the rate of non-zero CA fuel values in this table) have auxiliary heating capability, so do have their own fuel consumption independent of the CT. This low-efficiency supplementary heat seems to be used for seasonal peaking, based on the monthly fuel usage shapes I saw. It is generally much smaller than the fuel used by the CTs, like 1-2 orders of magnitude.

Also, the raw 860 Generator spreadsheet has a column Can Bypass Heat Recovery Steam Generator? (we don't include it in pudl for some reason) that shows that only about 30% of CTs can run as standalone GTs as of 2019.

I mentioned graph edge deletion in case it impacted connected-component/subplant identification. I guess it's really graph node deletion - aggregating a CT and CA together into a single node distorts the graph. I'll take a look at the BGA and unit_id processes for interference.

TrentonBush commented 2 years ago

The 860 Generator spreadsheet has another column duct burners (again not in pudl) that indicate the capability of supplemental heating of HRSGs. In 2019 65% of CAs and CSs had duct burners.

This made me think of another possiblity: maybe the duplicate plants are actually mislabeled CS?

zaneselvans commented 2 years ago

I'm seeing both duct_burners and bypass_steam_recovery in the field metadata, and both are part of the generators_entity_eia table (meaning they are static generator attributes and don't change year to year). Where are you looking and not finding them? I've got about 9500 generator entity records with duct_burners==True in my PUDL DB.

TrentonBush commented 2 years ago

oh I was looking in generators_eia860 where a bunch of the other columns from the spreadsheet were located. I didn't realize those were normalized out.