Open TrentonBush opened 6 months ago
I think the two main goals of retaining multiple versions of data are:
Is it a cop-out to say we should do input, output, and directly integrate historical data :)? Saving versioned input data requires integrating historical data into our data warehouse. Versioning our outputs will make it easier to debug data quality changes between code updates and allow users to pin to specific versions.
Versioning input and output data seems like low effort and high reward to me (we kind of already store all historic versions of our input data but it could use some clean up). Directly integrating historical data into slowly changing dimension tables would make it easy to calculate trends in our core datasets.
I'm imagining we could use Type 2 Slowly Changing Dimensions to model our key datasets:
project_id
, data_version
.queue_id
, iso_region
, data_version
(or date_asof).race_id
, data_version
(or date_asof).Some random considerations:
To be able to more easily compare changes in datasets, we will save a directory of parquet files of each data warehouse and mart tables in a cloud bucket for each "release" we do. A release is when we merge dev
into main
, a github tag is created and data is pushed to the prod BigQuery datasets.
Create a template notebook that compares two tables from different versions.
Create a Type 2 slowly changing dimension (SCD) table for the queue data. This will allow users to monitor changes in projects over time. We'll need to figure out how to integrate multiple years of data while minimizing code duplication and bespoke cleaning steps.
Create SCDs for our other key datasets: EIP, Ballot Ready, Ordinances.
Create SCDs for all of our other datasets and develop a system for creating a versioned collection of raw dataset like we do in PUDL.
Spun off from the analysis-focused #301 . How should we re-design this system to store previous versions of the data? And not just for ISO queues.
Versioning data inputs vs outputs
This decision is a tradeoff between engineering effort and analytical flexibility. An example of an analysis task that would be impacted by this decision is the following:
Output-oriented versioning
Versioning data outputs means we simply save the outputs of our ETL. We can then compare those outputs to analyze changes between versions. But this system constrains some analyses. In the example analysis task above, we need to re-process historical data with updated code -- this is impossible with only saved outputs.
Input-oriented versioning
Versioning data inputs allows greater analytical flexibility because we can re-interpret old data. We already version our data inputs, with the exception of data linked via Coupler.
Thanks to our input data archives, it is possible, though not easy, to re-process historical data. For the example analysis task above, we would have to:
This is a lot of work! But with more engineering effort, we can streamline this process.
Hybrid: directly integrating historical data
The highest capability option is to directly integrate historical data into our data model. This would allow us to re-interpret old data without the hassle of fetching and modifying old code. This is, however, a much larger scope of engineering work -- ETL code would need to run not just on the latest data, but on all historical data as well.
Due to the engineering effort required, this option is best reserved for only the highest impact or most frequently changing datasets. What are those?
What temporal change do we want to capture?
High actionability, high frequency (sub-annual) changes:
High actionability, low frequency (super-annual) changes:
Low actionability, high frequency (sub-annual) changes:
Low actionability , low frequency (super-annual) changes:
Drawing a line between "local ordinances" and "existing generation", I count ~9 datasets that we want to track longitudinally and ~7 that are less important to track.
Value considerations:
Is this worth doing or should we just snapshot some outputs?
data engineering Qs: