catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
467 stars 107 forks source link

Dagster refactor design #1960

Closed bendnorman closed 1 year ago

bendnorman commented 1 year ago

Our current dagster refactor plan doesn't describe how to persist interim tables and how to apply dagster concepts to the output and analysis layers.

Data Persistence questions

See above

Output layer questions

bendnorman commented 1 year ago

The Design

To reiterate some of our goals with dagster:

My initial iterations of using dagster for PUDL mostly used ops and graphs. Wrapping pudl functions in these abstractions felt like we were adding an additional layer of complexity, see discussion in #1835. Wrapping functions in ops also didn't provide a clear way to persist interim, output, and analysis tables to the database. These issues lead me to dagster's other paradigm, software-defined assets.

Pros of software-defined assets:

My plan is to convert each pudl function that extracts or cleans a dataframe to an asset or multi_asset. For example, extraction functions will be wrapped in multi_assets and produce an asset for each raw dataframe. Most individual table transform functions will become assets because they can depend on multiple tables and produce a single asset. There will be no need for loading functions because IO Managers will handle the loading to storage (sqlite and parquet).

The Plan

This refactor will touch everything and it will take a while! To avoid the new branch and dev from getting super out of sync I propose we apply these changes to two phases. The first phase will convert the ETL processes to assets, and the second phase will convert the output and analysis tables. This will make an already monster PR and a little less monster. I believe converting just the ETL initially won't be an issue because the PudlTable class is pretty distinct from the ETL.

Here are some rough steps for the first phase:

  1. Convert EIA ETL. Implement and test a SQLite IO Manager.
  2. Convert FERC ETL.
  3. Convert EPA CEMS ETL. Implement and test a Parquet IO manager.

At each step, I'd love to get feedback to make sure I'm building the right stuff! Once people approve the EIA ETL changes, I'll start to update the test suite to accommodate the dagster changes.

Clarifying Questions