Closed zaneselvans closed 2 months ago
This is super helpful thank you! Identifying parts of our process that aren't uniform seems like a good place to start.
Based on your comments above, this is what I’m currently envisioning at a high level:
Make our extraction steps more uniform by extracting our data from zenodo and load each partition into a database (possible using Airbyte).
Our transformation steps will then read the raw data from the db into python and perform some transformations:
Each transformation step could load interim tables with appropriate names back to the db. This database will contain:
Limitations:
These thoughts might be out of scope for this issue but I wanted to jot them down somewhere.
This issue was fodder for our transition to an orchestration tool. Most of these issues/improvements were addressed by our migration to dagster.
We started building PUDL before we really understood much about ETL tools or software engineering and as a result the data transformations that PUDL does are not particularly well organized in terms of when / where they happen, and they don't have a standard API. This is affecting our ability to refactor the code and integrate new data and functionality.
To better understand what all the moving parts are and how we might assemble them more appropriately, this issue attempts to catalog and then categorize them. This should feed into the Prefect refactoring discussion #840.
This is a work in progress.
Extraction
Inputs
ferc1
,eia923
,epacems
)Outputs
Issues
report_date
column. This need not be true but thus far generally has been. But maybe this just something that's solved by construction -- i.e. any raw tables that can't be trivially made concatenatable get different names, and the more complex reshaping is relegated to the Transform step, which can combine multiple raw tables into a single output or split individual tables into more than one if needed.Existing Operations:
Potential Changes / Improvements:
Transformation & Tidying
Basic cleaning and reshaping of the data required to columns to the point of having uniform data types, physical units, codes, etc.
Applying transformations at different stages / scopes
Examples:
pudl.helpers.fix_eia_na()
applies to pretty much any spreadsheet-based EIA data, across all columns and tables, and replaces any lone period.
or entirely whitespace cell with a real NA value. This could be automatically applied to any EIA data when it's first extracted, or to all the dataframes after they've been concatenated, rather than within individual transformation functions.Normalization / Entity Resolution / Harvesting
Reconciliation of data that is reported in multiple places inconsistently. Removal of duplicated information so that the output DB has a single source of truth.
Data Repair
Replacement of obvious outliers and missing values with best estimates to provide as complete a dataset as possible.
Integration
Integration tasks link together cleaned data so that they can be used together more effectively. This can also involve linking together records within a single dataset.
plant_id_pudl
andutility_id_pudl
values with individual plants and utilities found in the FERC 1 and EIA datasets.Output / Display
Assembly of existing data for easy use or interpretation by humans. This type of operation can typically be accomplished using straightforward SQL.
Analysis
More complex analysis that generates new knowledge unavailable directly elsewhere. These operations benefit from having as clean and complete inputs as possible.