the steam and fuel table are both processed/concatenated and the ferc plant-id-er works!
I'd love some overall design input... I pretty aggressively moved cleaning details into global variable so I could make all functions take a df and/or a table name and/or a source of the ferc1 data (xbrl/dbf) in the hopes that this will be more dagster-friendly and standardized.
questions:
is that function call design even reasonable?
if so, can we/should we migrate a lot of the content in these global variable into the pudl metadata. I can see benefits of doing this but most of these feel really really ferc specific (even if they are technically generalize-able).
best way to pass around dfs (raw and transformed)? (I think this is in part answered in ben's dagster comment)
I don't feel like I have a good way to do this in a standard way. especially because of the interdependence of some of these tables (steam and fuel for instance!).... it seems like using the dagster op/decorators would enable this standardization... but rn we either:
in the main transform function we always load all of the tables so we can be explicit which raw/transform tables are fed into each table transform.
do the above (explicit unique table args) but with an if table_name in ferc1_settings.tables: before each table call... sounds gross to me.
do some clean standard thing for all the tables except steam! i thiiiiink we could do something like:
ferc1_tfr_dfs = {}
# make all the non-steam tables
for table in ferc1_settings.tables:
ferc1_tfr_dfs[table_name] = global().get(table_name)(
ferc1_dbf_raw_dfs.get(table_name),
ferc1_xbrl_raw_dfs.get(table_name)
)
# make the steam table using fuel
ferc1_tfr_dfs["plants_steam_ferc1"] = plants_steam_ferc1(
steam_dbf_raw=ferc1_dbf_raw_dfs.get("plants_steam_ferc1"),
steam_xbrl_raw=ferc1_dbf_raw_xbrl.get("plants_steam_ferc1"),
fuel_transformed=ferc1_tfr_dfs.get("fuel_ferc1")
)
remaining tasks:
[x] standardize the oob_to_nan
[x] delete _old functions
[ ] dagster-friendly pudl.transform.ferc1.transform function
[ ] test if the dbf solutions for _multiplicative_error_correction are applicable for the xbrl data.
[ ] probably more...
avoiding for now tasks:
[ ] migrate the metadata stored here into the pudl table metadata (when possible/applicable.
[ ] the utility_id_ferc1
[ ] checking if the plant-ID-er is actually doing a good job
[ ] standardizing the extract step
See #1707 and #1722 for the table-specific task lists
I think it would be cleaner for the extract step to save dfs with pudl table names with _instant and _duration suffixes when applicable. Instead of the current setup which saves the ferc1 raw table names. This way we could automatically grab the extracted tables with just the pudl table name (which is now the argument for many many of these transform functions).
@cmgosnell commented on Mon Jun 27 2022
status:
the steam and fuel table are both processed/concatenated and the ferc plant-id-er works!
I'd love some overall design input... I pretty aggressively moved cleaning details into global variable so I could make all functions take a df and/or a table name and/or a source of the ferc1 data (xbrl/dbf) in the hopes that this will be more dagster-friendly and standardized.
questions:
transform
function we always load all of the tables so we can be explicit which raw/transform tables are fed into each table transform.if table_name in ferc1_settings.tables:
before each table call... sounds gross to me.remaining tasks:
oob_to_nan
pudl.transform.ferc1.transform
function_multiplicative_error_correction
are applicable for the xbrl data.avoiding for now tasks:
See #1707 and #1722 for the table-specific task lists
@cmgosnell commented on Wed Jun 29 2022
a note about the extract step:
_instant
and_duration
suffixes when applicable. Instead of the current setup which saves the ferc1 raw table names. This way we could automatically grab the extracted tables with just the pudl table name (which is now the argument for many many of these transform functions).@review-notebook-app[bot] commented on Thu Jul 07 2022
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
@zaneselvans commented on Fri Aug 12 2022
I asked some questions about the FERC1 transform refactor design in this comment on #1739
@zaneselvans commented on Wed Aug 17 2022
Also @bendnorman did you see this list of outstanding questions on the linked issue? https://github.com/catalyst-cooperative/pudl/issues/1739#issuecomment-1213323493