catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Apply naming convention to output assets #2788

Closed bendnorman closed 7 months ago

bendnorman commented 10 months ago

Apply naming convention to output layer assets.

- [x] #2856
- [x] #2858
- [x] #2865
- [ ] #2882
- [x] MCOE assets #2904
- [x] Plant parts list assets #2904
- [x] exploded ferc1 output assets
- [ ] https://github.com/catalyst-cooperative/pudl/pull/3029
- [x] Rename ferc_to_eia asset

Questions

bendnorman commented 10 months ago

Hey @katie-lamb! I'm trying to decide which output tables should be exposed to users and which ones should be intermediate assets that don't get persisted to the database. Do you think all assets in the allocate_gen_fuel and mcoe groups should be persisted in the database? Are the assets in these groups working towards a single asset that users should interact with? For example, it looks like the assets in the mcoe group all flow towards two assets, mcoe_generators_monthly and mcoe_generators_yearly:

Image

Rows 353-369 list all of the allocate_gen_fuel and mcoe assets.

katie-lamb commented 10 months ago

@bendnorman Yep, the idea behind the mcoe_generators_monthly and mcoe_generators_yearly tables is that they're the data mart counterparts of the mcoe_monthly and mcoe_yearly tables and have all the generator attributes merged onto them. We made that decision because Christina uses the data mart version of the MCOE, but the skinny version is much cleaner to read. I think I'd make a few changes to the MCOE process in an ideal world:

The heat rate, fuel cost, and capacity factor tables do feel pretty intermediate. It was also expressed that the heat rate by generator table is a little misleading because heat rates don't really make sense at a generator level. I'm not sure if there were strong feelings about leaving this tables in the DB.

bendnorman commented 10 months ago

Got it, thanks! So it sounds like everything upstream of mcoe_generators_monthly, mcoe_generators_yearly, mcoe_monthly, mcoe_yearly in the mcoe group don't need to be persisted to the database?

@zaneselvans and @cmgosnell which allocate_gen_fuel assets do think you think need to be in the db?

cmgosnell commented 10 months ago

I'm personally pretty inclined to put all of the generator level heat rate, fuel cost, and capacity factor data into the mcoe_{freq} tables & not save any of the upstream tables to the db. This would mean that there are some nulls for some of the data fields but I don't think that is particularly untidy structurally because it is still well normalized - it'll just have more nulls.

The only table in this upstream mcoe stuff that is structurally different are the heat_rate_by_unit_{freq} tables so i think there is cause to db-ify those as well.

my 🪙🪙:

also light clarification on the heat rate by generators: the asset there is based on the unit-level heat rate which is fine! calculating heat rates from generator-level data without aggregating to units first is where things get weird. For combined cycle units in particular bc the gas turbine side is generally the side that burns the vast majority of the fuel where the steam side just uses the excess steam from the latter.

bendnorman commented 10 months ago

Got it. Would we need to change the logic of these assets or can we just persist the three assets you listed and pickle everything else?

cmgosnell commented 10 months ago

I don't think any logic would need changing... but @katie-lamb you were just in there recently and would know best.

katie-lamb commented 10 months ago

Yep, I don't think any logic would need to be changed off the top of my head.

also light clarification on the heat rate by generators: the asset there is based on the unit-level heat rate which is fine! calculating heat rates from generator-level data without aggregating to units first is where things get weird.

Right right, I forgot that heat rate by generator is downstream of heat rate by unit and makes sense again. So if you did want to persist heat rate by unit then go for it (because heat rate by generator level information is in the MCOE table). Alternatively, you could put heat rate by unit as a column in the MCOE table and not persist the heat by unit table.

bendnorman commented 10 months ago

Alternatively, you could put heat rate by unit as a column in the MCOE table and not persist the heat by unit table.

I'm on a table pruning quest so I feel intrigued by this idea :) I'm not super familiar with units and heat rates so I'm not sure if a generator-level MCOE table would be a logical place to explore unit heat rates. What do y'all think?

Also, which assets in the allocate_gen_fuel group should be user facing?

katie-lamb commented 10 months ago

I'm on a table pruning quest so I feel intrigued by this idea :) I'm not super familiar with units and heat rates so I'm not sure if a generator-level MCOE table would be a logical place to explore unit heat rates. What do y'all think?

Ya I agree, it's maybe not so logical. I'm not sure how important it is for unit level heat rates to be accessible.

Also, which assets in the allocate_gen_fuel group should be user facing?

I'm not entirely sure which ones are useful as outputs. My intuition is generation_fuel_by_generator_{agg}_eia923. The other ones feed into the MCOE and/or PPE process I believe.

zaneselvans commented 10 months ago

heat_rate_by_gen is just heat_rate_by_unit merged onto the generators using their associated unit_id_pudl so if we have an output table that includes the plant, unit, and generator IDs, then just keeping a column that indicates the heat rate by unit seems like a great idea -- it'll be more obvious on what basis the heat rate actually makes physical sense, but it'll still be in a table associated with the generator IDs, if that's the entity someone needs to work with.

I think keeping all of the "intermediate" values that we calculate on the way to cost per MWh in the output table is important -- capacity factor and heat rate are really useful basic generator metrics to track over time and are used in a lot of modeling to describe the energy system (this is the primary value that both of the Gregs depend on from us). I think we should keep any record that has any non-null derived values, and make sure we null out any dependent values that relied upon out-of-bounds values upstream (maybe we're already doing this)

Also also... I think we should probably stop referring to this table MCOE. The only cost it includes is fuel, and mostly it's a compilation of useful unit or generator level derived values. If/when we bring non-fuel variable O&M, fixed O&M, and ongoing CapEx from FERC 1 into a table alongside the fuel costs, we'll have a real MCOE table -- that was where we thought we were headed when we started with this table. It's just taken looooooooonger than expected to get there.

bendnorman commented 9 months ago

Thanks for all of the input y'all! Here are a few remaining questions I have:

arengel commented 9 months ago

Thanks Ben, we actually use all of those tables but I am guessing that if you just provided generation_fuel_by_generator_energy_source_monthly_eia923 and generation_fuel_by_generator_energy_source_owner_yearly_eia923, the remaining could be derived from those. If a simple SQL query can't faithfully reproduce the others then we'd want them all.

zaneselvans commented 9 months ago

@bendnorman I'm pretty sure that @grgmiller uses some subset of those generation_fuel_by_generator tables as well. @arengel if you want to look at the asset definitions that are used to build these tables that should give a decent idea of whether a simple SQL query can reproduce them... but if you're using that information already, and we're producing it already, and it's important that the data being used definitely reflect the data as it existed in these intermediate states for internal consistency, it seems like we should probably keep them.

We've been assuming that we would add those other cost components to the MCOE table for like 5 years, and I think that there are really two distinct sets of information that's being compiled which probably each deserve their own tables. What we have right now is derived generator or unit level characteristics that ultimately impact the attribution of fuel costs, which an independent set of interesting variables from the actual electricity production cost components (which will include fuel costs and also all the other fixed & variable operating costs and capital expenditures from FERC Form 1). So I think we should rename our current MCOE table to reflect the derived generator/unit attributes or quantities that it contains, which I think are

The only real heat rate is heat_rate_by_unit and I think that column should be in the output tables, along with both generator_id and unit_id_pudl which will make it clearer what "unit" the heat rate pertains to, and get rid of the false precision of having a "per generator" heat rate which is just per unit heat rate broadcast across all the generators in a unit (so I'm suggesting getting rid of heat_rate_by_generator entirely).

But over time I could imagine accumulating additional useful quantities that belong in a table like this. E.g. estimates of:

What should a table like that be called? out__derived_generator_operational_characteristics? Kinda long. Seems like it would eventually contain aggregated values from CEMS as well as these derived values from EIA860/923. And maybe it does make sense for the many different cost components to end up in the same table when they're compiled. But a table with all of that information definitely wouldn't be "MCOE".

bendnorman commented 9 months ago

Sounds good @arengel and @zaneselvans! We'll keep all of the assets in the allocate_gen_fuel group in the database. I generally think PUDL should provide the most granular data and useful aggregates so users don't have to expend extra time figuring out the proper way to aggregate tables.

So @zaneselvans, you're imagining having two tables, one for generator operational characteristics (what our current mcoe table is) and a future table for financial information? When we have both, we can create a proper mcoe table? How about out__{freq}_generator_operational_characteristics for the operational characteristics table? We've been pretty verbose with the names so far, which could cause issues later if we move to a database with column name length limits. We could also do something like out__{freq}_generator_ops.

zaneselvans commented 9 months ago

An issue with out__{freq}_generator_operational_characteristics is that there are lots of operational characteristics that are reported directly, and already show up in the generator annual and entity tables. Do we want all of those characteristics to be in the same table regardless of whether they're directly reported or the result of some involved calculation the depends on reported time series data? The primary keys of the tables are likely to be the same, so structurally it seems like it would be fine. But many of the derived values will be very sparse, because they only apply to some types of generators, or there's only data available in the appropriate form for a small subset of the generators.

Would it make sense to append all of the cost components into that same table as we compile them? The generators table already has more than 100 columns, but maybe 200 columns is fine? It's not hard to just select the dozen columns you actually need for whatever calculation you're doing, and if/when we put the data in a columnar data structure it would be very efficient. And I guess that's a common pattern for data warehouses. But somehow it still feels unwieldy to me. I imagine users would end up curating their own thematic lists of columns that they use in different contexts.