Open zaneselvans opened 1 year ago
Made a little bit of progress on Schedule 8C in #2447, but ran into some weirdness. The mappings seem good, but it kept failing on the 2018 data for no reason that I could understand. Listing the files in the zip archive it seemed to be there with the right name. But then when I unzipped the archive… the Schedule 8 spreadsheet was missing! We download the zipfiles as-is from EIA when we archive them, and the checksums match between my local file and the Zenodo archive, and so it seems that EIA somehow managed to publish a zipfile in October, 2022 that had all the files in the index, but didn’t have the data for one of them in the archive. Not sure how that happened!
The current version of the Zipfile from the EIA website has all of the schedules, and there’s no notes on the data page about fixing this issue. Our more recent Zenodo archive from February, 2023 has all the files, but there are other structural changes that make switching to using it additional work right now. For the moment I have commented out the emissions_control
page ID in pudl.extract.eia923
so it doesn't run into this problem. We can uncomment and work with a newer archive next time we update the EIA-923 data.
Note that there's also Schedule 8 data for 2008-2011 but none of it has been mapped yet.
See @grgmiller's comment about incomplete control IDs
Schedule 8 data in EIA 923 contains valuable monthly information about the operation, cost and status of environmental equipment data. Let's bring it in!
After raw tables are extracted, each table should be cleaned to prepare for harvesting into the relevant entity (plant, boiler, SO2 control unit, e.g.). This includes transforming the dataset to have defined datatypes, well-defined primary keys, and standardized NAs.
Once these
_core_eia923__{x}
tables are cleaned, we will harvest them into tables for the relevant entities. See #3365.Known data oddities: