catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Reduce coupling in ETL and shift toward EtLT #1315

Closed zaneselvans closed 2 months ago

zaneselvans commented 2 years ago

Description

Coupling between different datasets and between different phases of the data processing has often been a limiting factor in our ability to publish new data quickly, or to work on several different aspects of data integration in parallel. This epic is focused on refactoring the pipeline to reduce coupling so we can integrate new data, and more datasets, more quickly and less monolithically.

Motivation

Scope

Processing that could probably happen before loading into an interim data warehouse:

Data processing steps that could be deferred until after a simplified data warehouse has been loaded:

Logistics

Billing

A new time-tracking project should be created for this Epic, taking time from the Sloan Maintenance & Refactoring category.

Priority/Timing

This should probably happen in conjunction with (shortly after) we get the Prefect based pipeline up and running nightly.

bendnorman commented 2 years ago

Should we add our airbyte and dagster research issues to this?

zaneselvans commented 2 years ago

Hmm. Yeah that does seem right. And maybe my "enumerate all the transformations" discussion / design issue too.

bendnorman commented 2 months ago

This issue is out of date. The design issues mentioned in this issue were mostly addressed by our migration to dagster.