catalyst-cooperative / pudl-scrapers

Scrapers used to acquire snapshots of raw data inputs for versioned archiving and replicable analysis.
MIT License
3 stars 3 forks source link

make eia923 scrapper w/ 2021 ER data #33

Closed cmgosnell closed 1 year ago

cmgosnell commented 1 year ago

Augment the spider to grab the ER data and make new zenodo archive

zaneselvans commented 1 year ago

Sometimes I feel like scrapping all this data too man.

cmgosnell commented 1 year ago

I did not need to edit the scrapers to make this work. I'm not sure why our past scrapers were not grabbing the ER files. I vaguely recall the ER data being in a separate box on the page, which is definitely not the case now. Both the 2021 ER and 2022 partial year was scrapped/archived.

Sandbox version: https://sandbox.zenodo.org/record/1090056 Big kid version: https://zenodo.org/record/6953766 (DOI: 10.5281/zenodo.6953766)

zaneselvans commented 1 year ago

So as the page is currently formatted, it identifies the ER data as the 2021 data (without any indication that it's ER?) and the 2022 partial / monthly updates as "the 2022" data even though it's not complete? Do we foresee that causing any issues? I guess we can just adjust what years are "working" In the data source metadata, and when we bring in a new archive, if there are changes to the spreadsheet formatting we'll have to update e.g. the skiprows.