Scope out monthly ISO queue integration

TrentonBush commented 6 months ago

Spun off from the analysis-focused #301 . How should we re-design this system to store previous versions of the data? And not just for ISO queues.

Versioning data inputs vs outputs

This decision is a tradeoff between engineering effort and analytical flexibility. An example of an analysis task that would be impacted by this decision is the following:

Oh no! We just realized that our ISO queue deduplication was too aggressive, and we have been discarding lots of important projects. Did that mistake have a big impact on the trends we observed in renewable deployment?

Output-oriented versioning

Versioning data outputs means we simply save the outputs of our ETL. We can then compare those outputs to analyze changes between versions. But this system constrains some analyses. In the example analysis task above, we need to re-process historical data with updated code -- this is impossible with only saved outputs.

Input-oriented versioning

Versioning data inputs allows greater analytical flexibility because we can re-interpret old data. We already version our data inputs, with the exception of data linked via Coupler.

Thanks to our input data archives, it is possible, though not easy, to re-process historical data. For the example analysis task above, we would have to:

check out a previous version of the code
make our methodological changes
possibly modify it for compatibility with current infrastructure
re-run the analysis and save the output
repeat all of the above for each version of the input data, if it changed multiple times

This is a lot of work! But with more engineering effort, we can streamline this process.

Hybrid: directly integrating historical data

The highest capability option is to directly integrate historical data into our data model. This would allow us to re-interpret old data without the hassle of fetching and modifying old code. This is, however, a much larger scope of engineering work -- ETL code would need to run not just on the latest data, but on all historical data as well.

Due to the engineering effort required, this option is best reserved for only the highest impact or most frequently changing datasets. What are those?

What temporal change do we want to capture?

High actionability, high frequency (sub-annual) changes:

fossil infrastructure (via EIP)
new renewable development (via ISO queues, DBCP, EIA 860m)
electoral data (via BallotReady)

High actionability, low frequency (super-annual) changes:

local ordinances (via Columbia, NREL, DBCP)

Low actionability, high frequency (sub-annual) changes:

existing generation (via PUDL)

Low actionability , low frequency (super-annual) changes:

state permitting (via NCSL)
FIPS codes (via census)
energy communities (via RMI, maybe Interior soon)
Demography (Justice40)
Government land jurisdiction (via PAD-US)
avoided carbon emissions (via EPA AVERT)

Drawing a line between "local ordinances" and "existing generation", I count ~9 datasets that we want to track longitudinally and ~7 that are less important to track.

Value considerations:

Is this worth doing or should we just snapshot some outputs?

what questions can we answer with longitudinal data that we cannot answer with pairwise differences?
- tracking how much time it takes to get through the various stages of interconnection (eg how long does a SIS take in MISO? Has that time increased or decreased in the past year?)
- look for sharp changepoints in time via trendlines (eg how have planned projects changed since the IRA passed?)
- etc.
Which of the above are most compelling to users?

data engineering Qs:

* how to represent this data? Slowly Changing Dimensions (SCD)?
* do we have consistent IDs or do we need to do record linkage?
* which columns to integrate?
* what other considerations?

bendnorman commented 5 months ago

I think the two main goals of retaining multiple versions of data are:

be able to say "X,Y,Z changed between update A and B"
calculate metrics for certain entities. For example: "% change in actionable projects in each county"

Is it a cop-out to say we should do input, output, and directly integrate historical data :)? Saving versioned input data requires integrating historical data into our data warehouse. Versioning our outputs will make it easier to debug data quality changes between code updates and allow users to pin to specific versions.

Versioning input and output data seems like low effort and high reward to me (we kind of already store all historic versions of our input data but it could use some clean up). Directly integrating historical data into slowly changing dimension tables would make it easy to calculate trends in our core datasets.

I'm imagining we could use Type 2 Slowly Changing Dimensions to model our key datasets:

fossil infrastructure (via EIP): Primary key would be project_id, data_version.
new renewable development (via ISO queues, DBCP, EIA 860m): Primary key would be queue_id, iso_region, data_version (or date_asof).
electoral data (via BallotReady): Primary key would be race_id, data_version (or date_asof).
local ordinances (via Columbia, NREL, DBCP). What is the natural key for a local ordinance?

Some random considerations:

Code for cleaning each data version will mostly be the same except for small fixes. How do we reduce duplication?
Right now each ISO dataset is versioned separately. Should probably group ISO updates under one version.

bendnorman commented 5 months ago

Phase 1

To be able to more easily compare changes in datasets, we will save a directory of parquet files of each data warehouse and mart tables in a cloud bucket for each "release" we do. A release is when we merge dev into main, a github tag is created and data is pushed to the prod BigQuery datasets.

Create a template notebook that compares two tables from different versions.

Phase 2

Create a Type 2 slowly changing dimension (SCD) table for the queue data. This will allow users to monitor changes in projects over time. We'll need to figure out how to integrate multiple years of data while minimizing code duplication and bespoke cleaning steps.

Phase 3

Create SCDs for our other key datasets: EIP, Ballot Ready, Ordinances.

Phase 4

Create SCDs for all of our other datasets and develop a system for creating a versioned collection of raw dataset like we do in PUDL.

deployment-gap-model-education-fund / deployment-gap-model