catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Apply naming convention to straggler assets #3052

Closed bendnorman closed 5 months ago

bendnorman commented 7 months ago

Oofta! I thought #3029 was the last renaming PR but it turns out there are a few straggler assets that haven't been renamed.

### Tasks
- [x] `calculation_components_xbrl_ferc1` -> `_core_xbrl_ferc1__calculation_components`
- [x] `clean_xbrl_metadata_json` -> `_core_ferc1_xbrl__metadata_json`
- [x] `metadata_xbrl_ferc1` -> `_core_ferc1_xbrl__metadata`
- [x] `table_dimensions_ferc1` -> `_core_ferc1__table_dimensions`
- [x] `exploded_balance_sheet_assets_ferc1` -> `_out_ferc1__yearly_exploded_balance_sheet_assets`
- [x] `exploded_balance_sheet_liabilities_ferc1` -> `_out_ferc1__yearly_exploded_balance_sheet_liabilities`
- [x] `exploded_income_statement_ferc1` -> `_out_ferc1__yearly_exploded_income_statement`
- [x] `censusdp1tract_to_sqlite` -> `raw_census__dp1`
- [x] `emissions_unit_ids_epacems` -> `_core_epacems__emissions_unit_ids`
- [x] `state_average_fuel_costs_eia` -> `_out_eia__monthly_state_average_fuel_costs`
- [x] `eia860_raw_dfs` -> `raw_eia860__all_dfs`
- [x] `eia861_raw_dfs` -> `raw_eia861__all_dfs`
- [x] `eia923_raw_dfs` -> `raw_eia923__all_dfs`
- [x] `raw_xbrl_metadata_json` -> `raw_ferc1_xbrl__metadata_json`
- [x] `core_ferc714__hourly_demand_pa` -> `core_ferc714__hourly_demand_by_planning_area` (???)
- [x] `out_ferc714__hourly_predicted_state_demand` -> `out_ferc714__hourly_estimated_state_demand`
- [x] Write `out_eia__yearly_generators_by_ownership` to the PUDL DB instead of just pickling. Is anything being returned by `PudlTabl.gens_mega_eia`?
- [x] Double check everything is renamed
- [x] Create migration!
- [x] Run ETL locally
- [x] Run integration tests locally
- [ ] Decide on how to describe exploded tables

None of these assets are currently saved to the database. Given no one currently expects these to be in the database, I propose we hold off on renaming these assets so we can merge #2818 and start doing data-only releases ASAP.

Questions

For @bendnorman: I did some exploded asset renaming in https://github.com/catalyst-cooperative/pudl/pull/2914

zaneselvans commented 6 months ago

I don't know if we want to deal with it now, but there are also a bunch of columns that do not conform to our column naming conventions that I think we should rename at some point. E.g. the rolling averages for CapEx etc in the FERC 1.

I think the place to start would be going through fields.py to ensure that there aren't any obvious wierdos. Identifying any fields that appear in that module which aren't part of any resource and purging them might be a good idea.

zaneselvans commented 5 months ago

The Census DP1 naming isn't great. It really contains state, county, and tract data (in 3 different tables / assets). If we're renaming things it seems like now would be a good time to shift it to just census dp1?

cmgosnell commented 5 months ago

Should the exploded assets be a part of the core of our output layer?

definitely output! they are so weird and modified.

Should the exploded assets be written to the database?

yes (probably)! but we've been kinda waiting to publish them. Right now the main remaining step in the way of publishing them is actually finalizing a fourth rate base table that is much more explainable and user facing. I want to wait to actually publish the current three exploded tables until we have this rate base table because really most users should never use the exploded tables and we can point them to the rate base tables. I believe rmi-ers will want access to the three exploded tables.

What should the naming convention be for these exploded assets?

out_ferc1__yearly_exploded_{core seed table base name}:

Notes: i think the "core seed table base name" here should not include the schedule number because these tables include multiple tables with multiple schedule numbers.

cmgosnell commented 5 months ago

hm but maybe exploded is too... idk not informative.. maybe it should be _detailed or _granular_break_down (omigosh that's long) or something? the gist is it takes a super high level data point like total assets and converts that into as much detail as is possible via the calculations/connections between records that is embedded in the FERC data. We've been calling it explosion bc it explodes a small set of info into a large, more granular view. But that might be too colloquial.

bendnorman commented 5 months ago

I like detailed or granular! IDK if those are too vague but probably more informative than exploded.

cmgosnell commented 5 months ago

I think i like detailed!

cmgosnell commented 5 months ago

also i don't love the bare dp1 for the census... mostly bc idk what that means. i can look it up but it doesn't tell me anything as is. how about raw_census__demographics or raw_census__demographics_dp1 in the same spirit of adding FERC1 schedules to the end of those asset names.

DP-1. Profiles of General Demographic Characteristics

also does it matter that this asset is not a table? the sql part of it does communicate something that i personally find helpful bc without that it looks like it is just a table like the majority of the assets.

bendnorman commented 5 months ago

I agree a more descriptive name would be helpful. How about raw_census__demographic_profiles_dp1?

I'm not sure I understand your question. Are you asking why raw_census__dp1 is not written to the database?

bendnorman commented 5 months ago

I also like detailed! One issue is that there are ~130 references to "exploded" in the code. Should we create a separate PR for renaming all instances of exploded to detailed?

cmgosnell commented 5 months ago

I like raw_census__demographic_profiles_dp1!

I'm not sure I understand your question. Are you asking why raw_census__dp1 is not written to the database?

I mean this isn't actually a table its some weird reference to a sqlite db right?

and i personally think its fine if we keep calling it explode or explosion in the code. especially as a verb i think it makes sense. but very open to converting away from the explosion language

bendnorman commented 5 months ago

Oh I see. The raw_census__demographic_profiles_dp1 asset returns a path to the database. We have some other assets that don't return dataframes. I think to be consistent we should leave the storage method out of the asset name for now.

Would you be ok with having the tables use detailed but the code use exploded for a bit?