deployment-gap-model-education-fund / deployment-gap-model

ETL code for the Deployment Gap Model Education Fund
https://www.deploymentgap.fund/
MIT License
6 stars 2 forks source link

Scope out monthly ISO queue integration #300

Open TrentonBush opened 6 months ago

TrentonBush commented 6 months ago

Spun off from the analysis-focused #301 . How should we re-design this system to store previous versions of the data? And not just for ISO queues.

Versioning data inputs vs outputs

This decision is a tradeoff between engineering effort and analytical flexibility. An example of an analysis task that would be impacted by this decision is the following:

Oh no! We just realized that our ISO queue deduplication was too aggressive, and we have been discarding lots of important projects. Did that mistake have a big impact on the trends we observed in renewable deployment?

Output-oriented versioning

Versioning data outputs means we simply save the outputs of our ETL. We can then compare those outputs to analyze changes between versions. But this system constrains some analyses. In the example analysis task above, we need to re-process historical data with updated code -- this is impossible with only saved outputs.

Input-oriented versioning

Versioning data inputs allows greater analytical flexibility because we can re-interpret old data. We already version our data inputs, with the exception of data linked via Coupler.

Thanks to our input data archives, it is possible, though not easy, to re-process historical data. For the example analysis task above, we would have to:

This is a lot of work! But with more engineering effort, we can streamline this process.

Hybrid: directly integrating historical data

The highest capability option is to directly integrate historical data into our data model. This would allow us to re-interpret old data without the hassle of fetching and modifying old code. This is, however, a much larger scope of engineering work -- ETL code would need to run not just on the latest data, but on all historical data as well.

Due to the engineering effort required, this option is best reserved for only the highest impact or most frequently changing datasets. What are those?

What temporal change do we want to capture?

High actionability, high frequency (sub-annual) changes:

High actionability, low frequency (super-annual) changes:

Low actionability, high frequency (sub-annual) changes:

Low actionability , low frequency (super-annual) changes:

Drawing a line between "local ordinances" and "existing generation", I count ~9 datasets that we want to track longitudinally and ~7 that are less important to track.

Value considerations:

Is this worth doing or should we just snapshot some outputs?

data engineering Qs:

* how to represent this data? Slowly Changing Dimensions (SCD)?
* do we have consistent IDs or do we need to do record linkage?
* which columns to integrate?
* what other considerations?
bendnorman commented 5 months ago

I think the two main goals of retaining multiple versions of data are:

  1. be able to say "X,Y,Z changed between update A and B"
  2. calculate metrics for certain entities. For example: "% change in actionable projects in each county"

Is it a cop-out to say we should do input, output, and directly integrate historical data :)? Saving versioned input data requires integrating historical data into our data warehouse. Versioning our outputs will make it easier to debug data quality changes between code updates and allow users to pin to specific versions.

Versioning input and output data seems like low effort and high reward to me (we kind of already store all historic versions of our input data but it could use some clean up). Directly integrating historical data into slowly changing dimension tables would make it easy to calculate trends in our core datasets.

I'm imagining we could use Type 2 Slowly Changing Dimensions to model our key datasets:

Some random considerations:

bendnorman commented 5 months ago

Phase 1

To be able to more easily compare changes in datasets, we will save a directory of parquet files of each data warehouse and mart tables in a cloud bucket for each "release" we do. A release is when we merge dev into main, a github tag is created and data is pushed to the prod BigQuery datasets.

Create a template notebook that compares two tables from different versions.

Phase 2

Create a Type 2 slowly changing dimension (SCD) table for the queue data. This will allow users to monitor changes in projects over time. We'll need to figure out how to integrate multiple years of data while minimizing code duplication and bespoke cleaning steps.

Phase 3

Create SCDs for our other key datasets: EIP, Ballot Ready, Ordinances.

Phase 4

Create SCDs for all of our other datasets and develop a system for creating a versioned collection of raw dataset like we do in PUDL.