catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Integrate EIA-923 Annual Environmental Information (Schedule 8) #2448

Open zaneselvans opened 1 year ago

zaneselvans commented 1 year ago

Schedule 8 data in EIA 923 contains valuable monthly information about the operation, cost and status of environmental equipment data. Let's bring it in!

### Extract raw data
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3383
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3372
- [ ] #2447
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3384
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3385
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3388

After raw tables are extracted, each table should be cleaned to prepare for harvesting into the relevant entity (plant, boiler, SO2 control unit, e.g.). This includes transforming the dataset to have defined datatypes, well-defined primary keys, and standardized NAs.

### Clean tables to prepare for harvesting
- [ ] Clean EIA-923 Schedule 8A Annual Byproduct Disposition
- [ ] Clean EIA-923 Schedule 8B Financial Information
- [ ] Clean EIA-923 Schedule 8C Air Emissions Control Info
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3392
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3393

Once these _core_eia923__{x} tables are cleaned, we will harvest them into tables for the relevant entities. See #3365.

Known data oddities:

zaneselvans commented 1 year ago

Made a little bit of progress on Schedule 8C in #2447, but ran into some weirdness. The mappings seem good, but it kept failing on the 2018 data for no reason that I could understand. Listing the files in the zip archive it seemed to be there with the right name. But then when I unzipped the archive… the Schedule 8 spreadsheet was missing! We download the zipfiles as-is from EIA when we archive them, and the checksums match between my local file and the Zenodo archive, and so it seems that EIA somehow managed to publish a zipfile in October, 2022 that had all the files in the index, but didn’t have the data for one of them in the archive. Not sure how that happened!

The current version of the Zipfile from the EIA website has all of the schedules, and there’s no notes on the data page about fixing this issue. Our more recent Zenodo archive from February, 2023 has all the files, but there are other structural changes that make switching to using it additional work right now. For the moment I have commented out the emissions_control page ID in pudl.extract.eia923 so it doesn't run into this problem. We can uncomment and work with a newer archive next time we update the EIA-923 data.

Note that there's also Schedule 8 data for 2008-2011 but none of it has been mapped yet.

zaneselvans commented 1 year ago

See @grgmiller's comment about incomplete control IDs