Closed zaneselvans closed 2 months ago
Should we add our airbyte and dagster research issues to this?
Hmm. Yeah that does seem right. And maybe my "enumerate all the transformations" discussion / design issue too.
This issue is out of date. The design issues mentioned in this issue were mostly addressed by our migration to dagster.
Description
Coupling between different datasets and between different phases of the data processing has often been a limiting factor in our ability to publish new data quickly, or to work on several different aspects of data integration in parallel. This epic is focused on refactoring the pipeline to reduce coupling so we can integrate new data, and more datasets, more quickly and less monolithically.
Motivation
Scope
Processing that could probably happen before loading into an interim data warehouse:
Data processing steps that could be deferred until after a simplified data warehouse has been loaded:
Logistics
Billing
A new time-tracking project should be created for this Epic, taking time from the Sloan Maintenance & Refactoring category.
Priority/Timing
This should probably happen in conjunction with (shortly after) we get the Prefect based pipeline up and running nightly.