catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

Rename FERC Form 1 core and output assets #2992

Closed bendnorman closed 11 months ago

bendnorman commented 11 months ago

For our first pass at renaming the FERC Form 1 core and output assets, we just applied the new naming convention to the existing asset name. We later realized some of the original asset_names were not consistent and did not accurately describe the data. Our Form 1 masters @zaneselvans and @cgosnell created more descriptive and consistent names in the Naming Conventions spreadsheet.

We should make these name changes before we merge in #2818 so we don't have two rounds of widespread name changes.

I started to pluralize some of the FERC 1 core tables in #2914, but it would probably be easier to cut a new branch off of #2818 and replace the asset names in the "Asset name in rename-core-asset branch" column of the spreadsheet with the names with y'alls suggestions.

Don't forget to update the asset names in the transformation metadata csvs! I forgot to do this in the first round of renaming and ran into some unexpected errors.

- [x] Finalize new ferc1 asset names
- [x] Update the [PUDL Naming Conventions tab](https://docs.google.com/spreadsheets/d/1RBuKl_xKzRSLgRM7GIZbc5zUYieWFE20cXumWuv5njo/edit#gid=1126117325) of the Naming Conventions spreadsheet
- [x] Rename core ferc1 assets
- [x] Rename output ferc1 assets
- [x] Rename the `pudl.transform.ferc1` classes/TableId's to reflect new table names
cmgosnell commented 11 months ago

Okay I have two FERC1 naming questions that I'd love some feedback on:

This whole comment will make more sense if you go looks at the three iterations of suggestions in this tab.

its electric

There are 9 tables that either have electric in the name or we could add it in.

my proposal

Form 1 is a form for electric utilities. It seems very reasonable to assume that most of the tables are about electric utilities. We've already been systematically adding utility_type into these FERC1 tables (lots of the tables have it already, more in #2894). If we were actually consistent about the names here we would need to add electric into at least another 7 table names. My suggestion is to remove all electric from the table names but make sure we have utility_type == "electric" in those tables.

its a plant

(i'm not talking about the generation plant tables here) there are three depreciation tables + one non-deprish table that have plant in them right now.

my proposal

remove plant! in the accounting sense of the word plant, I think all depreciation tables by their very definition contain plants. i don't think this is a particularly helpful addition all of these tables have some plant id-ing columns. I feel less strongly about this one overall than about electric in the table names.

zaneselvans commented 11 months ago

Removing plant and electric is attractive in a kind of "database normalization" sense, since it would reduce duplicative information, but I don't know that that's the most important thing about how we name the tables.

To me having the explicit utility_type or plant_* columns inside the tables feels independent of the table names. Looking at the list of tables, nobody is going to know what columns are inside them.

The old DBF table names were hopelessly illegible with their inconsistent and highly abbreviated names. The XBRL table names are almost ridiculously long and descriptive, but I think that does help identify what the heck is inside them -- especially if someone is coming in with a familiarity with the "paper" FERC Form 1, table names that somewhat correspond to the schedule titles in the PDF will help them find what they're looking for, even if they aren't the 150 character long titles that they use in XBRL.

There are also some limited cases in which there's non-electric utility information being reported in the FERC 1 tables.

If we were to remove both plant and electric then instead of electric_plant_in_service_ferc1 would we just have in_service_ferc1? That seems pretty terse and hard to interpret.

Another option that could help folks navigate the PDF to database chasm is using the schedule number in the table names, as the XBRL table names do. That makes it very easy to find the corresponding information in the XBRL Taxonomy viewer or to find the pages in the PDF that correspond to the information in the table. They also seem to be pretty stable over the years.

I think having e.g. electric_plant_ in the name of a table is more about differentiating that table from others that don't pertain to plant at all. E.g. the income_statement electric_energy_sources and electric_energy_dispositions tables aren't about accounting for the capital stock, they're about the flow of income (and expenses?) or energy, and I think that's a useful kind of information to have in the list of tables. When I am exploring a new database the first thing I do looks something like...

.tables -- get a list of all the tables in the DB
SHOW ALL TABLES; -- gives more information about the tables, but often gets truncated.
DESCRIBE table_name; -- list the columns in the table and their types, nullability, etc.
SUMMARIZE table_name; -- calculate some summary statistics for the table. Min/max median, null-ness, etc.

It seems like the tension here is between the inconvenience and potentially duplicative nature of longer names, and the desire for the table names to be legible and connected to information users are likely to already be familiar with.

zaneselvans commented 11 months ago

There's also our colloquial usage of "plants" in the case of generation plant vs. FERC's use of the capital accounting "plant" and if we're going to remove "plant" then should we also remove "plants?" Or should we specify that the tables containing "plants" now are specifically generation_plant. And if we do that, then for consistency should we be specifying the other kinds of non-generation plant where the tables refer to a particular type? (are there any of those?)

cmgosnell commented 11 months ago

I mostly dislike the electricin the table names because it is inconsistent, it seems at least semi-arbitrary across the tables, its duplicative with the content of the tables themselves and if we make it consistent across the tables than only 7 of our 24 FERC1 tables would not have electric in the name.

bendnorman commented 11 months ago

Electric

Are there utilities in these tables where utility_type != "electric"? If there is a mix of utility types in these tables, I don’t think it makes sense to include “electric” in the table name because there might be data that doesn't describe an electric utility. Even if utility_type == "electric" for all records it seems duplicative to include "electric" in the table name.

However, if there are tables that don't describe electric utilities, we should probably include "electric" in the table names of tables that do describe electric utilities.

Plant

I agree with @zaneselvans that users should get a sense of what is in the table without having to look at columns or values. If the entity the tables describe is a "plant" and not all Form 1 tables describe plants, then I think we should include "plant" in the table name so they can be differentiated.

cmgosnell commented 11 months ago

Electric

Are there utilities in these tables where utility_type != "electric"? If there is a mix of utility types in these tables, I don’t think it makes sense to include “electric” in the table name because there might be data that doesn't describe an electric utility. Even if utility_type == "electric" for all records it seems duplicative to include "electric" in the table name.

I believe all of the tables with the suggestion for electric in the name are fully utility_type == "electric"

However, if there are tables that don't describe electric utilities, we should probably include "electric" in the table names of tables that do describe electric utilities.

Every respondent to FERC1 is an electric utility, but some of the tables like the income statement table include income from non-electric portions of their business. Out of our current FERC tables, 7 of them include non-electric portions of the FERC1 respondents.

Plant

I agree with @zaneselvans that users should get a sense of what is in the table without having to look at columns or values. If the entity the tables describe is a "plant" and not all Form 1 tables describe plants, then I think we should include "plant" in the table name so they can be differentiated.

I agree with this for the plant in service table. But for the three depreciation tables, it seems duplicative because depreciation is necessarily about plant assets.

cmgosnell commented 11 months ago

Desires (some of which are conflicting)

jdangerx commented 11 months ago

We had a synchronous call:

Decisions:

cmgosnell commented 11 months ago

thanks y'all for the chat yesterday to come to a good decision on these names!

Here are the new name suggestions:

Asset name in dev NEW NAMES
balance_sheet_assets_ferc1 core_ferc1__yearly_balance_sheet_assets_sched110
balance_sheet_liabilities_ferc1 core_ferc1__yearly_balance_sheet_liabilities_sched110
cash_flow_ferc1 core_ferc1__yearly_cash_flows_sched120
depreciation_amortization_summary_ferc1 core_ferc1__yearly_depreciation_summary_sched336
electric_energy_dispositions_ferc1 core_ferc1__yearly_energy_dispositions_sched401
electric_energy_sources_ferc1 core_ferc1__yearly_energy_sources_sched401
electric_operating_expenses_ferc1 core_ferc1__yearly_operating_expenses_sched320
electric_operating_revenues_ferc1 core_ferc1__yearly_operating_revenues_sched300
electric_plant_depreciation_changes_ferc1 core_ferc1__yearly_depreciation_changes_sched219
electric_plant_depreciation_functional_ferc1 core_ferc1__yearly_depreciation_by_function_sched219
electricity_sales_by_rate_schedule_ferc1 core_ferc1__yearly_sales_by_rate_schedules_sched304
fuel_ferc1 core_ferc1__yearly_steam_plants_fuel_sched402
income_statement_ferc1 core_ferc1__yearly_income_statements_sched114
other_regulatory_liabilities_ferc1 core_ferc1__yearly_other_regulatory_liabilities_sched278
plant_in_service_ferc1 core_ferc1__yearly_plant_in_service_sched204
plants_hydro_ferc1 core_ferc1__yearly_hydroelectric_plants_sched406
plants_pumped_storage_ferc1 core_ferc1__yearly_pumped_storage_plants_sched408
plants_small_ferc1 core_ferc1__yearly_small_plants_sched410
plants_steam_ferc1 core_ferc1__yearly_steam_plants_sched402
purchased_power_ferc1 core_ferc1__yearly_purchased_power_and_exchanges_sched326
retained_earnings_ferc1 core_ferc1__yearly_retained_earnings_sched118
transmission_statistics_ferc1 core_ferc1__yearly_transmission_lines_sched422
utility_plant_summary_ferc1 core_ferc1__yearly_utility_plant_summary_sched200
cmgosnell commented 11 months ago

okay two output table name questions regarding the schedule # suffixes:

bendnorman commented 11 months ago

These table names look great thank you!

I say we keep the schedule numbers around until the tables are combined with other tables from different datasets or schedules.

cmgosnell commented 11 months ago

That makes total sense to me @bendnorman and is my inclination as well.

These are the only FERC names that i think had any weirdness in regards to the schedule name and where i think this lands:

Current Asset Name Proposed Asset Name
denorm_fuel_by_plant_ferc1 out_ferc1__yearly_steam_plants_fuel_by_plant_sched402
denorm_plants_all_ferc1 out_ferc1__yearly_all_plants
denorm_plants_utilities_ferc1 _out_ferc1__yearly_plants_utilities
cmgosnell commented 11 months ago

oop i realized there is still one downstream ferc1 naming question:

Current Asset Name Proposed Asset Name
plant_parts_eia out_eia__yearly_plant_parts
denorm_plants_all_ferc1 out_ferc1__yearly_all_plants
out__yearly_plants_all_ferc1_plant_parts_eia ????

Options:

basically how do we deal w/ double datasets?

bendnorman commented 11 months ago

By double datasets, do you mean assets that come from multiple sources? We typically make source optional once there is data from multiple sources. What does the data in out__yearly_plants_all_ferc1_plant_parts_eia describe?

cmgosnell commented 11 months ago

Yes double datasets as in two or more input sources.

out__yearly_plants_all_ferc1_plant_parts_eia is the record linkage between out_ferc1__yearly_all_plants and out_eia__yearly_plant_parts. the good old pudl_out.ferc1_eia.

The two (?) analogous tables I can think of are core_epa__assn_epacamd_eia core_epa__assn_epacamd_eia_subplant_ids. But this table is an epa table that epa linked to eia and we augment that link.

How about using PUDL as the source name?

How about using FERC as the source name?

This table effectively is the out_ferc1__yearly_all_plants with the out_eia__yearly_plant_parts graphed onto it, so it is at whatever heterogeneous granulatiry of plant groupings that FERC1 reports.

because of core_pudl__assn_plants_ferc1 and core_pudl__assn_plants_eia it seems like we could:

if we end up making an EIA generators table with FERC data scaled down (re #2946) we could name that out_eia__yearly_assn_plants_all_ferc1 or ya know. I think the main downside here is that it's not really an assntable in that it's not a 1:m association. Each FERC1 plant(ish) record is linked to one EIA plant parts record.

bendnorman commented 11 months ago

We made some core_pudl__... assets because there are association tables that we manually compiled. I suppose we created this ferc to eia connection, just not manually so I guess using core_pudl is appropriate? Should this apply to future downstream assets that combine multiple datasets, though? Maybe our policy should be if the raw data was manually compiled by PUDL source should be pudl. If it's an asset that combines data from multiple data sources, source should be left blank.

I think it's important to include the names of the entities and sources that an assn table links so I'm in favor of out_pudl__yearly_assn_ferc1_eia_plant_parts and out_pudl__yearly_assn_ferc1_eia_plants.

Is each entity in these tables a plant or a plant part?

zaneselvans commented 11 months ago

I think in many of our assets there's more than one input source and at that point it's okay for the notion of the "source" to get diluted and left out of the name. But in the case of association / glue tables whose whole point is to link two tables from different data sources together it's probably good to refer to both of those data sources somehow and include assn in the name. So...

I think that whatever this table is named it should definitely include:

So maybe that's out__yearly_assn_ferc1_eia_plant_parts

If we are going to use pudl as a data source ID, I think we should be clear about what that means, since it's obviously attributing the data to ourselves somehow. I think that's clearly correct when we've manually compiled something, but less so in this case where we're algorithmically deriving something.

zaneselvans commented 11 months ago

Also for the love of god I wish we could standardize on the ORDERING of multiple data sources. Like if there's more than one always do them in alphabetical order.

bendnorman commented 11 months ago

@cmgosnell and I just chatted about how to name this asset. We settled on out_pudl__yearly_assn_ferc1_eia_plant_parts. We decided to use pudl as a source when we create the connection between two datasets (core_pudl__assn_plants_eia,core_pudl__assn_plants_ferc1, and this asset...) and have no source when we combine datasets using a shared key we did not create.

@cmgosnell can probably better describe why we chose plant_parts over plants.

cmgosnell commented 11 months ago

okay we're going with: out_pudl__yearly_assn_eia_ferc1_plant_parts!

why:

okay i love the alphabetization. that's a good call. we had ferc1_eia... in lots and lots of places! I preserved the pudl_out.ferc1_eia

eia/epa

this isn't FERC1 related but there are lots of epa_eia as well for the crosswalk:

for these, I think that epa being the source makes sense. they do publish the association, we just augment it. but for the assn_epacamd_eia part, should we reorder it to be assn_eia_epacamd*??