Open bendnorman opened 1 year ago
The scope and work breakdown sounds right to me!
I think if Catalyst has a bigger interest in archiving then it makes sense for Catalyst to host the archiving code and runner. That way Catalyst can extend the archive to data beyond the ISO queues or update at higher frequencies if desired. The DBCP code can just pull from Catalyst archives as if it was any other public source.
I imagine we'd want to archive different datasets separately for the availability reasons you outlined above (like if NYISO fails one particular day). I don't think there is anything tying the ISO vintages together, right? Like we could pull today's CAISO data and last week's NYISO data if we thought there was something wrong with the latest NYISO release.
I think the ETL code should be in the DBCP repo so it can focus on the specific needs of this project. At first the DBCP code can pull from pinned vintages of ISO data that is manually validated. Then we can add auto-update logic to fetch the latest and greatest versions.
Considering some (most?) ISO queues only update monthly, I would guess that this repo doesn't need daily updates. But if we can safely update daily, then sure, let's do it. My only concern is that the higher the update frequency the higher the degree of automation we need for data validation or we risk breaking stuff downstream. I'm not sure yet how big a lift that is. We'd have to make very strict checks and auto-update only if they pass, else revert and flag for manual review. I'd guess we could start with manual updates ~monthly until we build up a library of automated checks, then transition to higher frequency.
I agree that we need to compare the Grid Status data to the existing LBNL version. I think the ISO queues include both a queue entry and exit date, so hopefully we can filter based on those and reproduce the LBNL version. If not, we'll have to do some error analysis and figure out how/why they differ and whether we can live with it.
The LBNL data includes a column with the source ISO (or non-ISO), so I expect we can simply filter for only non-ISO data and combine it with the latest ISO data from Grid Status.
Oh are the Catalyst archives not public or "requester pays"? I'd expect this data to be < 5MB uncompressed.
I think we'll want to keep these archives private for now.
Tasks
iso_projects_long_format
tableis_actionable
projects in CAISO in GS than in LBNLis_nearly_certain
projects in MISOis_nearly_certain
projects in LBNL NYISO. There are 47 in GS.interconnection_status
column for ISOs that have status spread across multiple columns.test_iso_projects_data_mart_aggregates_are_close
test to accommodate for gridstatus data. @TrentonBushQuestions
Scope
Minimum viable scope:
Things to consider:
Integration
Validtation