Integrate gridstatus for near realtime ISO Queue data

bendnorman commented 1 year ago

Tasks

[x] #268
[x] #269
[x] Handle CAISO one to many relationships
[x] standardize fuels
[x] check county FIPS coverage fraction is close to what it is for LBNL. Create validation checks to make sure null fips in both are similar.
[x] Do we need to adapt the deduplication algorithm? @TrentonBush
[x] make db schema
[x] normalize 1:m relationships (CAISO only?)
[x] adapt data_mart/projects.py to new data
[x] merge with non-ISO LBNL data. Keep manual offshore wind entries
[x] Verify CO2 emissions model still works. Make sure capacity and fuel are still there. Maybe explore CO2 emissions by county.
[x] verify downstream dependencies still work (county-level aggs). Make sure all of the columns in the projects dataframe are still there. Check for nulls.
[x] Double check counties and project are one to one. Which areas have the one-to-many relationship in LBNL?
[x] Add utility and developer columns to GS data
[ ] ~#302~
[x] add new GCP authentication directions to README
[x] Run ETL with latest data
[x] recreate county comparison notebook using the iso_projects_long_format table
[ ] FInding way more is_actionable projects in CAISO in GS than in LBNL
[ ] Way less is_nearly_certain projects in MISO
[ ] There are no is_nearly_certain projects in LBNL NYISO. There are 47 in GS.
[x] Self review
[ ] create a single interconnection_status column for ISOs that have status spread across multiple columns.
[x] Clean notebook outputs
[x] Fix test_iso_projects_data_mart_aggregates_are_close test to accommodate for gridstatus data. @TrentonBush
[ ] How should we handle transmission projects?
[ ] ~standardize tech types. It is helpful for calculating CO2 estimates but not a super high priority. This would be a feature enhancement.~

Questions

When does a project leave ERCOT's GIS report? All "Completed" projects in ERCOT are marked as "Active" in LBNL's data set. Most of the projects do not have "Actual Completion Date" values.
How often do we update the offshore wind data? GS has 44 offshore wind projects. We have 21.

Scope

Minimum viable scope:

Create an action that archives the gridstatus ISO queue data in a cloud bucket daily. Should the code live in this repo or a Catalyst repo? Bonus points: archive all gridstatus data. See #268 for more information.
Extract the ISO queue data from the cloud bucket.
Clean the ISO queue data and harmonize it with the lBNL data columns and RTO data.
Make any schema changes and update the data dictionary.

Things to consider:

BQ is still being updated manually. By adding girdstatus, our source data will be updated almost daily in some cases. Do we want to spend time setting up an action that updates the BQ data daily using the code on main?
Should we only be archiving successful extractions of the iso queue data from gridstatus? The API occasionally fails because the ISOs change their data format without notice. For example what should we do if gridstatus returns all ISO queues except for NYISO? Should we save the non NYISO data or skip it entirely? If we save failed extractions from gridstatus we'll need some logic in the ETL to find the latest successful archive.
How will we compare our old data to the new gridstatus data? @TrentonBush
How should we join the LBNL RTO data with the gridstatus ISO data? @TrentonBush

Integration

What do we want the final data warehouse and mart tables to look like?
What are the final generation type categories we use?
What are the final status categories we want to use?

Validtation

Compare MW of projects that are in both. This measure will not capture projects that aren’t in both.
Compare the generation type of projects that are in both
Compare total generation of counties. Make sure a majority of counties total capacity change is within a reasonable margin. We’ll likely see an increase in total capacity given GS has more capacity.
Is it worth it to compare status given the time difference between the two datasets?

TrentonBush commented 1 year ago

The scope and work breakdown sounds right to me!

Archiving

I think if Catalyst has a bigger interest in archiving then it makes sense for Catalyst to host the archiving code and runner. That way Catalyst can extend the archive to data beyond the ISO queues or update at higher frequencies if desired. The DBCP code can just pull from Catalyst archives as if it was any other public source.

I imagine we'd want to archive different datasets separately for the availability reasons you outlined above (like if NYISO fails one particular day). I don't think there is anything tying the ISO vintages together, right? Like we could pull today's CAISO data and last week's NYISO data if we thought there was something wrong with the latest NYISO release.

Processing

I think the ETL code should be in the DBCP repo so it can focus on the specific needs of this project. At first the DBCP code can pull from pinned vintages of ISO data that is manually validated. Then we can add auto-update logic to fetch the latest and greatest versions.

Update Frequency

Considering some (most?) ISO queues only update monthly, I would guess that this repo doesn't need daily updates. But if we can safely update daily, then sure, let's do it. My only concern is that the higher the update frequency the higher the degree of automation we need for data validation or we risk breaking stuff downstream. I'm not sure yet how big a lift that is. We'd have to make very strict checks and auto-update only if they pass, else revert and flag for manual review. I'd guess we could start with manual updates ~monthly until we build up a library of automated checks, then transition to higher frequency.

TrentonBush commented 1 year ago

Validation

I agree that we need to compare the Grid Status data to the existing LBNL version. I think the ISO queues include both a queue entry and exit date, so hopefully we can filter based on those and reproduce the LBNL version. If not, we'll have to do some error analysis and figure out how/why they differ and whether we can live with it.

Joining with LBNL

The LBNL data includes a column with the source ISO (or non-ISO), so I expect we can simply filter for only non-ISO data and combine it with the latest ISO data from Grid Status.

bendnorman commented 1 year ago

Archiving

I agree the code and action can live in a Catalyst repo. The data should be written to a Catalyst and DBCP bucket so the egress fees from the DBCP etl downloading the data are billed to DBCP.
Yes we'll want to archive the datasets separately so we can pull different vintages.

Processing

Totally agree. We should pin the ISO vintages and update them manually for now as long as we remember to update the data regularly!

Frequency

Yeah daily might be excessive.

TrentonBush commented 1 year ago

Oh are the Catalyst archives not public or "requester pays"? I'd expect this data to be < 5MB uncompressed.

bendnorman commented 1 year ago

I think we'll want to keep these archives private for now.

deployment-gap-model-education-fund / deployment-gap-model