The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
We are converting our ETL to use dagster software defined assets so tables can be cleaned and persisted in parallel. Here are the main goals of this epic:
Parallelize our ETL. Currently, all PUDL steps when in series when most portions of the ETL could run in parallel. For example, the EIA ETL doesn't depend on the FERC extraction steps. This will hopefully improve the performance of the ETL.
Run portions of the ETL from caches of upstream data. Currently, when we are cleaning tables in development, we need to rerun time intensive extraction steps. For example the EIA read_excel calls are very slow (see #1172). Dagster allows us to rerun code that generates specific tables using caches of upstream data inputs. This will enable faster table cleaning.
Visual documentation of our complex ETL. PUDL is complex and we almost never have an up to date diagram that describes how the system works! Dagster displays the dependencies between data tables and the code that generates the data. This will improve contributor onboarding.
Uniform interface for persisting data to storage. PUDL currently lacks a uniform interface for persisting data to storage (Parquet files and sqlite databases). For example, pudl.load.dfs_to_sqlite and pudl.convert.ferc_to_sqlite write data frames to sqlite databases using similar / duplicate code. Dagster IO Managers provide a common interface for persisting data frames to storage. This will make integrating new datasets easier because we can reuse code that persists data frames to storage. People integrating new datasets will just have to create a function that returns a data frame.
Persist interim tables. Currently, PUDL only persists the clean normalized tables at the end of the ETL. We'd like to persist the raw and partially cleaned data for a couple of reasons. Some users might want access to the raw data to apply different cleaning methods. Also, being able to integrate partially cleaned data will allow us to release draft data from new data sources. Assets will represent valuable checkpoints of tables we clean. This talk discusses the benefits of the "asset mindset". Assets allow you to deliver incremental improvements to data.
We are converting our ETL to use dagster software defined assets so tables can be cleaned and persisted in parallel. Here are the main goals of this epic:
read_excel
calls are very slow (see #1172). Dagster allows us to rerun code that generates specific tables using caches of upstream data inputs. This will enable faster table cleaning.pudl.load.dfs_to_sqlite
andpudl.convert.ferc_to_sqlite
write data frames to sqlite databases using similar / duplicate code. Dagster IO Managers provide a common interface for persisting data frames to storage. This will make integrating new datasets easier because we can reuse code that persists data frames to storage. People integrating new datasets will just have to create a function that returns a data frame.