catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Catalog and categorize PUDL data processing tasks #1401

Closed zaneselvans closed 2 months ago

zaneselvans commented 2 years ago

We started building PUDL before we really understood much about ETL tools or software engineering and as a result the data transformations that PUDL does are not particularly well organized in terms of when / where they happen, and they don't have a standard API. This is affecting our ability to refactor the code and integrate new data and functionality.

To better understand what all the moving parts are and how we might assemble them more appropriately, this issue attempts to catalog and then categorize them. This should feed into the Prefect refactoring discussion #840.

This is a work in progress.

Extraction

Inputs

Outputs

Issues

Existing Operations:

Potential Changes / Improvements:

Transformation & Tidying

Basic cleaning and reshaping of the data required to columns to the point of having uniform data types, physical units, codes, etc.

Applying transformations at different stages / scopes

Examples:

Normalization / Entity Resolution / Harvesting

Reconciliation of data that is reported in multiple places inconsistently. Removal of duplicated information so that the output DB has a single source of truth.

Data Repair

Replacement of obvious outliers and missing values with best estimates to provide as complete a dataset as possible.

Integration

Integration tasks link together cleaned data so that they can be used together more effectively. This can also involve linking together records within a single dataset.

Output / Display

Assembly of existing data for easy use or interpretation by humans. This type of operation can typically be accomplished using straightforward SQL.

Analysis

More complex analysis that generates new knowledge unavailable directly elsewhere. These operations benefit from having as clean and complete inputs as possible.

bendnorman commented 2 years ago

This is super helpful thank you! Identifying parts of our process that aren't uniform seems like a good place to start.

Questions

Potential Changes

Based on your comments above, this is what I’m currently envisioning at a high level:

Make our extraction steps more uniform by extracting our data from zenodo and load each partition into a database (possible using Airbyte).

Our transformation steps will then read the raw data from the db into python and perform some transformations:

Each transformation step could load interim tables with appropriate names back to the db. This database will contain:

Limitations:

These thoughts might be out of scope for this issue but I wanted to jot them down somewhere.

bendnorman commented 2 months ago

This issue was fodder for our transition to an orchestration tool. Most of these issues/improvements were addressed by our migration to dagster.