Create an action that archives the gridstatus ISO queue data in a cloud bucket daily. Bonus points: archive all gridstatus data.
Steps
Create a script that grabs the iso queue data for each ISO.
Script should be designed so we can pass in different bucket destinations.
Should be designed so we can easily archive additional gridstatus datasets.
Archive the data using GCP Object Versioning. This feature creates unique ids for previous versions of objects. In the ETL, we pin a specific version of each ISO queue. The bucket structure will end up looking like this:
You can view the previous versions of the files using the GCS UI or by running this gsutil command:
gsutil ls -a `path/to/gcs_object/`
The dataframes should be stored as parquet files.
Create a github action that runs daily or weekly.
We'll also need some logic that catches any gridstatus failures and logs the error but doesn't kill the whole process.
Things to consider
Should we only be archiving successful extractions of the iso queue data from gridstatus? The API occasionally fails because the ISOs change their data format without notice. For example what should we do if gridstatus returns all ISO queues except for NYISO? Should we save the non NYISO data or skip it entirely? If we save failed extractions from gridstatus we'll need some logic in the ETL to find the latest successful archive.
We should point the ETL code to a specific snapshot of the data so our ETL won't randomly break. This means the data won't be as fresh as possible. To update the data, we'd have to create a new branch, point the ETL at the most recent snapshot of the data and make sure it doesn't break anything. If the new snapshot breaks the ETL, we'll have to adjust the ETL code or revert to a previous data snapshot.
Create an action that archives the gridstatus ISO queue data in a cloud bucket daily. Bonus points: archive all gridstatus data.
Steps
You can view the previous versions of the files using the GCS UI or by running this gsutil command:
Things to consider