Description

Preparations for our 7-15min long recurring demonstration of PUDL at OpenMod US 2023. We want to be able to show folks how to easily access and work with the data we publish using Jupyter notebooks, Datasette, nightly build outputs, etc.

Motivation

Get people aware of and excited about working with the open data we publish.
Give people enough of an intro that they feel able to play with the data on their own after the conference.
Target audience is folks that already have some domain knowledge (OpenMod attendees) but may have a variety of different technical backgrounds / familiarity with different sets of tools.

Scope

The PUDL Dataset on Kaggle is well documented (Usability of 9+ out of 10?)
The PUDL Dataset on Kaggle is being automatically updated based on nightly builds.
The notebooks associated with the PUDL Dataset on Kaggle are being automatically tested as the data evolves.
Our Datasette deployment is working and can handle a bit of a spike in new usage.
We are able to capture and analyze PUDL usage that results from this outreach.
- AWS downloads
- Datasette traffic
- RTD traffic
- Kaggle dataset views & downloads & notebook clones
- Zenodo downloads
We have a 7-15 minute demonstration that we can run through with a new user which covers:
- Interactive access & computation via Jupyter notebooks on Kaggle
- Browsing and querying of data on Datasette
- Bulk data download from the AWS Open Data Registry for local usage
- Bulk data download from a versioned Zenodo archive.
- Data Dictionaries that annotate the data on Read the Docs.

Out of Scope

Introducing users to PUDL development environment setup.
Introducing users to running the back-end / Dagster.

Comanche Notebook Outline:

Given narrative context around the plant, how do we find it in the data?
Create a table with some basic summary information about CO coal-fired generators.
Make a map of CO coal plants in 2010 vs 2022
- Group generators by plant and primary fuel type, sum capacity
Now we know EIA plant ID is 470, generators are 1, 2, 3. Dig in there.
Using monthly EIA-923 data show:
- total net generation in MWh
- total fuel consumption in MMBTU
- heat rate (thermal efficiency) in MMBTU / MWh
- fuel costs in $/MWh
- capacity factor
Using annual FERC Form 1 data show:
- annually averaged non-fuel operating costs in $/MWh
- annually averaged CapEx in $/MW of capacity
- Note that fuel consumption, fuel cost, and net generation is also available in FERC 1, but is not as granular or reliable as EIA-923.
- Highlight existence of multiple ownership slices and complicated reporting if it shows up.
Using EPA CEMS:
- Compare CEMS derived monthly net generation, fuel consumption, capacity factors, and implied heat rates with those we got from the EIA-923.
- Using hourly data, look at the structure of outages / operational loads.
- Highlight frequent outages for unit 3. Low capacity factor isn't because of ramping. It's either on or off.
- Calculate emissions.

# Minimum Requirements
- [x] Transfer ownership of PUDL dataset to Catalyst Cooperative on Kaggle (or create a new dataset if we can't transfer)
- [x] Schedule the Catalyst-owned PUDL dataset to update weekly.
- [ ] Merge the rename PR so that users see what the DB is going to look like going forward.
- [ ] Update example notebooks to work in the Kaggle python environment.
- [ ] Update example notebooks to work with the data-only outputs.
- [ ] Manually fill in dataset and file-level metadata.
- [ ] Develop a ~10 minute demonstration script.
- [ ] Ensure that table & column level previews for `pudl.sqlite` are working

# Example  Notebooks
- [x] Get [PUDL Example notebooks](https://github.com/catalyst-cooperative/pudl-examples) linked to the PUDL dataset.
- [x] Schedule example notebooks to run automatically when PUDL dataset is updated to verify that they still work.
- [x] Load data from SQLite
- [x] Load CEMS data from Parquet efficiently using dask
- [ ] Plot some energy system operational data
- [ ] Plot some utility financial data (maybe FERC 1 large plant expenses over time?)
- [ ] Make a service territory map
- [ ] Plot state-level electricity demand estimates.
- [ ] Demonstrate the link between FERC and EIA data.

# Stretch Goals
- [ ] Do a versioned data release on AWS & Zenodo
- [ ] Update `ferc-xbrl-extractor` to Frictionless v5 so we can correctly annotate the XBRL derived SQLite DBs.
- [ ] Update `pudl` to Frictionless v5 so we can correctly annotate the PUDL SQLite DB
- [ ] Create a valid `dataset-metadata.json` annotating all nightly build outputs for easier use on Kaggle.
- [ ] Create a Kaggle Organization to manage our datasets and competitions going forward

catalyst-cooperative / pudl

Prepare OpenMod PUDL Demo #2922

Description

Motivation

Scope

Out of Scope

Comanche Notebook Outline: