catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Consolidate ferc1 outputs using Dagster asset factories #3147

Open zaneselvans opened 6 months ago

zaneselvans commented 6 months ago

The top portion of the pudl.output.ferc1 module contains a number of individual asset definitions for denormalized / output tables with very similar structures, which could be consolidated into a small number of asset factories using the pattern adopted in e.g. pudl.extract.ferc714 (after PR #3123). See Dagster's blog post Factory Patterns in Python for some more background on the factory design pattern, and its application to Dagster assets.

Note that the calls to pudl.helpers.organize_cols() found in the current FERC 1 output asset definitions are no longer required, as the ordering of columns in the database is determined by the resource definitions / database schema now. These calls are leftover from when we were producing dataframes for users on request rather than writing these tables to the database.

Note that some of these assets currently create new columns containing derived values, and those would need to be preserved, either with their own asset definitions, or some way of keeping track of which calculations should be done for what tables inside the asset factory.

hfireborn commented 2 months ago

@catalyst-cooperative/com-dev Is this still open? I'd like to work on this as a first time contributor

zaneselvans commented 2 months ago

Hey there! Yes, this is still open. I was thinking about this as a good one after getting your office hours signup. There are lots of other examples of asset factories floating around that you could use as a guide. If you have a chance to get the PUDL / Dagster local development environment running, this should be a pretty easy thing to test out locally.