Closed bendnorman closed 1 year ago
Okay I have two FERC1 naming questions that I'd love some feedback on:
electric
in the tables names?plant
in the table names?This whole comment will make more sense if you go looks at the three iterations of suggestions in this tab.
electric
There are 9 tables that either have electric
in the name or we could add it in.
Form 1 is a form for electric utilities. It seems very reasonable to assume that most of the tables are about electric utilities. We've already been systematically adding utility_type
into these FERC1 tables (lots of the tables have it already, more in #2894). If we were actually consistent about the names here we would need to add electric
into at least another 7 table names. My suggestion is to remove all electric
from the table names but make sure we have utility_type == "electric"
in those tables.
depreciation_amortization_summary_ferc1
: has utility_type
electric_energy_dispositions_ferc1
: does not have utility_type
electric_energy_sources_ferc1
: does not have utility_type
electric_operating_expenses_ferc1
: has utility_type
electric_operating_revenues_ferc1
: has utility_type
electric_plant_depreciation_changes_ferc1
: has utility_type
electric_plant_depreciation_functional_ferc1
: has utility_type
electricity_sales_by_rate_schedule_ferc1
: does not have utility_type
plant_in_service_ferc1
: we're adding utility_type
plant
(i'm not talking about the generation plant tables here) there are three depreciation tables + one non-deprish table that have plant
in them right now.
remove plant! in the accounting sense of the word plant, I think all depreciation tables by their very definition contain plants. i don't think this is a particularly helpful addition all of these tables have some plant id-ing columns. I feel less strongly about this one overall than about electric
in the table names.
depreciation_amortization_summary_ferc1
: has plant_function
columnelectric_plant_depreciation_changes_ferc1
: has plant_status
columnelectric_plant_depreciation_functional_ferc1
: has plant_function
and plant_status
plant_in_service_ferc1
: we're adding plant_status
Removing plant
and electric
is attractive in a kind of "database normalization" sense, since it would reduce duplicative information, but I don't know that that's the most important thing about how we name the tables.
To me having the explicit utility_type
or plant_*
columns inside the tables feels independent of the table names. Looking at the list of tables, nobody is going to know what columns are inside them.
The old DBF table names were hopelessly illegible with their inconsistent and highly abbreviated names. The XBRL table names are almost ridiculously long and descriptive, but I think that does help identify what the heck is inside them -- especially if someone is coming in with a familiarity with the "paper" FERC Form 1, table names that somewhat correspond to the schedule titles in the PDF will help them find what they're looking for, even if they aren't the 150 character long titles that they use in XBRL.
There are also some limited cases in which there's non-electric utility information being reported in the FERC 1 tables.
If we were to remove both plant
and electric
then instead of electric_plant_in_service_ferc1
would we just have in_service_ferc1
? That seems pretty terse and hard to interpret.
Another option that could help folks navigate the PDF to database chasm is using the schedule number in the table names, as the XBRL table names do. That makes it very easy to find the corresponding information in the XBRL Taxonomy viewer or to find the pages in the PDF that correspond to the information in the table. They also seem to be pretty stable over the years.
I think having e.g. electric_plant_
in the name of a table is more about differentiating that table from others that don't pertain to plant at all. E.g. the income_statement
electric_energy_sources
and electric_energy_dispositions
tables aren't about accounting for the capital stock, they're about the flow of income (and expenses?) or energy, and I think that's a useful kind of information to have in the list of tables. When I am exploring a new database the first thing I do looks something like...
.tables -- get a list of all the tables in the DB
SHOW ALL TABLES; -- gives more information about the tables, but often gets truncated.
DESCRIBE table_name; -- list the columns in the table and their types, nullability, etc.
SUMMARIZE table_name; -- calculate some summary statistics for the table. Min/max median, null-ness, etc.
It seems like the tension here is between the inconvenience and potentially duplicative nature of longer names, and the desire for the table names to be legible and connected to information users are likely to already be familiar with.
There's also our colloquial usage of "plants" in the case of generation plant vs. FERC's use of the capital accounting "plant" and if we're going to remove "plant" then should we also remove "plants?" Or should we specify that the tables containing "plants" now are specifically generation_plant
. And if we do that, then for consistency should we be specifying the other kinds of non-generation plant where the tables refer to a particular type? (are there any of those?)
I mostly dislike the electric
in the table names because it is inconsistent, it seems at least semi-arbitrary across the tables, its duplicative with the content of the tables themselves and if we make it consistent across the tables than only 7 of our 24 FERC1 tables would not have electric in the name.
Are there utilities in these tables where utility_type != "electric"
? If there is a mix of utility types in these tables, I don’t think it makes sense to include “electric” in the table name because there might be data that doesn't describe an electric utility. Even if utility_type == "electric"
for all records it seems duplicative to include "electric" in the table name.
However, if there are tables that don't describe electric utilities, we should probably include "electric" in the table names of tables that do describe electric utilities.
I agree with @zaneselvans that users should get a sense of what is in the table without having to look at columns or values. If the entity the tables describe is a "plant" and not all Form 1 tables describe plants, then I think we should include "plant" in the table name so they can be differentiated.
Are there utilities in these tables where utility_type != "electric"? If there is a mix of utility types in these tables, I don’t think it makes sense to include “electric” in the table name because there might be data that doesn't describe an electric utility. Even if utility_type == "electric" for all records it seems duplicative to include "electric" in the table name.
I believe all of the tables with the suggestion for electric
in the name are fully utility_type == "electric"
However, if there are tables that don't describe electric utilities, we should probably include "electric" in the table names of tables that do describe electric utilities.
Every respondent to FERC1 is an electric utility, but some of the tables like the income statement table include income from non-electric portions of their business. Out of our current FERC tables, 7 of them include non-electric portions of the FERC1 respondents.
I agree with @zaneselvans that users should get a sense of what is in the table without having to look at columns or values. If the entity the tables describe is a "plant" and not all Form 1 tables describe plants, then I think we should include "plant" in the table name so they can be differentiated.
I agree with this for the plant in service table. But for the three depreciation tables, it seems duplicative because depreciation is necessarily about plant assets.
We had a synchronous call:
Decisions:
plant
? Yes - plant
(accounting asset) differentiates from plantS
(physical entities)
depreciation_*
, though - the only thing that can depreciate is plant
_plantS
tables as _generation_plants
? No - the "hydro"/"pumped storage" etc. prefixes implyschedXXX
to communicate that it is indeed a schedule number.thanks y'all for the chat yesterday to come to a good decision on these names!
Here are the new name suggestions:
Asset name in dev | NEW NAMES |
---|---|
balance_sheet_assets_ferc1 | core_ferc1__yearly_balance_sheet_assets_sched110 |
balance_sheet_liabilities_ferc1 | core_ferc1__yearly_balance_sheet_liabilities_sched110 |
cash_flow_ferc1 | core_ferc1__yearly_cash_flows_sched120 |
depreciation_amortization_summary_ferc1 | core_ferc1__yearly_depreciation_summary_sched336 |
electric_energy_dispositions_ferc1 | core_ferc1__yearly_energy_dispositions_sched401 |
electric_energy_sources_ferc1 | core_ferc1__yearly_energy_sources_sched401 |
electric_operating_expenses_ferc1 | core_ferc1__yearly_operating_expenses_sched320 |
electric_operating_revenues_ferc1 | core_ferc1__yearly_operating_revenues_sched300 |
electric_plant_depreciation_changes_ferc1 | core_ferc1__yearly_depreciation_changes_sched219 |
electric_plant_depreciation_functional_ferc1 | core_ferc1__yearly_depreciation_by_function_sched219 |
electricity_sales_by_rate_schedule_ferc1 | core_ferc1__yearly_sales_by_rate_schedules_sched304 |
fuel_ferc1 | core_ferc1__yearly_steam_plants_fuel_sched402 |
income_statement_ferc1 | core_ferc1__yearly_income_statements_sched114 |
other_regulatory_liabilities_ferc1 | core_ferc1__yearly_other_regulatory_liabilities_sched278 |
plant_in_service_ferc1 | core_ferc1__yearly_plant_in_service_sched204 |
plants_hydro_ferc1 | core_ferc1__yearly_hydroelectric_plants_sched406 |
plants_pumped_storage_ferc1 | core_ferc1__yearly_pumped_storage_plants_sched408 |
plants_small_ferc1 | core_ferc1__yearly_small_plants_sched410 |
plants_steam_ferc1 | core_ferc1__yearly_steam_plants_sched402 |
purchased_power_ferc1 | core_ferc1__yearly_purchased_power_and_exchanges_sched326 |
retained_earnings_ferc1 | core_ferc1__yearly_retained_earnings_sched118 |
transmission_statistics_ferc1 | core_ferc1__yearly_transmission_lines_sched422 |
utility_plant_summary_ferc1 | core_ferc1__yearly_utility_plant_summary_sched200 |
okay two output table name questions regarding the schedule # suffixes:
denorm_fuel_by_plant_ferc1
: should this bb be out_ferc1__yearly_steam_plants_fuel_by_plant
OR out_ferc1__yearly_steam_plants_fuel_by_plant_sched402
?? this output table comes from the og table but we've aggregated the records by plant (the original records are by plant and fuel type). basically should we keep the schedule # when we've changed the structure?denorm_plants_all_ferc1
: this should probably be out_ferc1__yearly_all_plants
. basically i'd just like to note that bc this is a concatenation of a bunch of schedules i think we should drop the _sched#s
These table names look great thank you!
_sched#s
when we've combined tables from multiple schedules.I say we keep the schedule numbers around until the tables are combined with other tables from different datasets or schedules.
That makes total sense to me @bendnorman and is my inclination as well.
These are the only FERC names that i think had any weirdness in regards to the schedule name and where i think this lands:
Current Asset Name | Proposed Asset Name |
---|---|
denorm_fuel_by_plant_ferc1 | out_ferc1__yearly_steam_plants_fuel_by_plant_sched402 |
denorm_plants_all_ferc1 | out_ferc1__yearly_all_plants |
denorm_plants_utilities_ferc1 | _out_ferc1__yearly_plants_utilities |
oop i realized there is still one downstream ferc1 naming question:
Current Asset Name | Proposed Asset Name |
---|---|
plant_parts_eia | out_eia__yearly_plant_parts |
denorm_plants_all_ferc1 | out_ferc1__yearly_all_plants |
out__yearly_plants_all_ferc1_plant_parts_eia | ???? |
Options:
basically how do we deal w/ double datasets?
By double datasets, do you mean assets that come from multiple sources? We typically make source
optional once there is data from multiple sources. What does the data in out__yearly_plants_all_ferc1_plant_parts_eia
describe?
Yes double datasets as in two or more input sources.
out__yearly_plants_all_ferc1_plant_parts_eia
is the record linkage between out_ferc1__yearly_all_plants
and out_eia__yearly_plant_parts
. the good old pudl_out.ferc1_eia
.
The two (?) analogous tables I can think of are core_epa__assn_epacamd_eia
core_epa__assn_epacamd_eia_subplant_ids
. But this table is an epa table that epa linked to eia and we augment that link.
out_pudl__yearly_assn_ferc1_eia_plant_parts
out_pudl__yearly_assn_ferc1_eia_plants
This table effectively is the out_ferc1__yearly_all_plants
with the out_eia__yearly_plant_parts
graphed onto it, so it is at whatever heterogeneous granulatiry of plant groupings that FERC1 reports.
because of core_pudl__assn_plants_ferc1
and core_pudl__assn_plants_eia
it seems like we could:
out_ferc1__yearly_assn_eia_plant_parts
out_ferc1__yearly_assn_plant_parts_eia
if we end up making an EIA generators table with FERC data scaled down (re #2946) we could name that out_eia__yearly_assn_plants_all_ferc1
or ya know. I think the main downside here is that it's not really an assn
table in that it's not a 1:m association. Each FERC1 plant(ish) record is linked to one EIA plant parts record.
We made some core_pudl__...
assets because there are association tables that we manually compiled. I suppose we created this ferc to eia connection, just not manually so I guess using core_pudl
is appropriate? Should this apply to future downstream assets that combine multiple datasets, though? Maybe our policy should be if the raw data was manually compiled by PUDL source should be pudl
. If it's an asset that combines data from multiple data sources, source
should be left blank.
I think it's important to include the names of the entities and sources that an assn
table links so I'm in favor of out_pudl__yearly_assn_ferc1_eia_plant_parts
and out_pudl__yearly_assn_ferc1_eia_plants
.
Is each entity in these tables a plant or a plant part?
I think in many of our assets there's more than one input source and at that point it's okay for the notion of the "source" to get diluted and left out of the name. But in the case of association / glue tables whose whole point is to link two tables from different data sources together it's probably good to refer to both of those data sources somehow and include assn
in the name. So...
I think that whatever this table is named it should definitely include:
out__
yearly
ferc1
eia
assn
plant*
So maybe that's out__yearly_assn_ferc1_eia_plant_parts
If we are going to use pudl
as a data source ID, I think we should be clear about what that means, since it's obviously attributing the data to ourselves somehow. I think that's clearly correct when we've manually compiled something, but less so in this case where we're algorithmically deriving something.
Also for the love of god I wish we could standardize on the ORDERING of multiple data sources. Like if there's more than one always do them in alphabetical order.
@cmgosnell and I just chatted about how to name this asset. We settled on out_pudl__yearly_assn_ferc1_eia_plant_parts
. We decided to use pudl
as a source when we create the connection between two datasets (core_pudl__assn_plants_eia
,core_pudl__assn_plants_ferc1
, and this asset...) and have no source
when we combine datasets using a shared key we did not create.
@cmgosnell can probably better describe why we chose plant_parts
over plants
.
okay we're going with: out_pudl__yearly_assn_eia_ferc1_plant_parts
!
why:
out_
: bc derivedpudl
: bc this is a derived connection that we have created. rn its lots of manual training data. but even if we move to an unsupervised model, it's still a new connection that we have birthed into the world. (it also currently relies on the manual plant_id_pudl
)yearly
: bc annual.ferc1
: duheia
: duhassn
: bc this is a one:one association table.plant_parts
: bc ferc doesn't report whole plants. they report heterogeneous clumps of plants (ex: Plant A, Plant B gas bits, Plant B coal bits, Plant C CT's, Plant C ST's, etc etc), which could otherwise be called plant parts 😎. which we link up with corresponding EIA plant parts. This table is at whatever granularity FERC1 reports so it is not a plants table like an EIA plants table where each record is one whole plant.okay i love the alphabetization. that's a good call. we had ferc1_eia... in lots and lots of places! I preserved the pudl_out.ferc1_eia
this isn't FERC1 related but there are lots of epa_eia
as well for the crosswalk:
for these, I think that epa
being the source makes sense. they do publish the association, we just augment it. but for the assn_epacamd_eia
part, should we reorder it to be assn_eia_epacamd*
??
For our first pass at renaming the FERC Form 1 core and output assets, we just applied the new naming convention to the existing asset name. We later realized some of the original asset_names were not consistent and did not accurately describe the data. Our Form 1 masters @zaneselvans and @cgosnell created more descriptive and consistent names in the Naming Conventions spreadsheet.
We should make these name changes before we merge in #2818 so we don't have two rounds of widespread name changes.
I started to pluralize some of the FERC 1 core tables in #2914, but it would probably be easier to cut a new branch off of #2818 and replace the asset names in the "Asset name in rename-core-asset branch" column of the spreadsheet with the names with y'alls suggestions.
Don't forget to update the asset names in the transformation metadata csvs! I forgot to do this in the first round of renaming and ran into some unexpected errors.