catalyst-cooperative / pudl-scrapers

Scrapers used to acquire snapshots of raw data inputs for versioned archiving and replicable analysis.
MIT License
3 stars 3 forks source link

Get all filings from FERC RSS feed, not just most recent 650 #45

Closed zschira closed 2 years ago

zschira commented 2 years ago

FERC recently released a notice that only the most recent 650 XBRL filings will be available through their RSS feed at any given time -

The eForms Accepted Filings RSS feed has been modified to support the performance of the eForms system. RSS feed subscriptions are limited to 650 current filings. To retrieve the feed for previous months, please modify and use the following example URLs:

https://ecollection.ferc.gov/api/rssfeed?month=5&year=2022 – May 2022 https://ecollection.ferc.gov/api/rssfeed?month=4&year=2022 – April 2022

This type of URL will work for all previous months.

Our scraper will need to be able to access all filings including those only available through these alternate URLs.

zaneselvans commented 2 years ago

It seems like this might belong in the https://github.com/catalyst-cooperative/pudl-scrapers repo?

zschira commented 2 years ago

Moved

zschira commented 2 years ago

Proposed Solution

This development complicates the scraping of XBRL data, however I think it's also an opportunity to start adding more automation to our scraping process starting with XBRL data.

Handling RSS Feed

Now that the feed is going to be segmented into smaller month specific feeds, we will need to be able to access at least a subset of those feeds every time we attempt to create a yearly archive for one of the forms. This archive would negate the need to make an ever growing list of requests to Ferc for each of those sub-feeds. This archive could be automatically updated monthly (or more frequently if desired) to add the previous month's feed, as well as the current feed with the latest 650 filings.

There are some downsides to relying on an archive of the feeds. For one, it would require the scraper to know how to access that archive, which does add some complexity. It also implicitly assumes that each month specific feed is static, which it should be, but I don't know if we can actually rely on Ferc to guarantee that will be true.

Creating and updating archives

Actual data archives could be updated monthly along with the feed archive. Doing monthly updates like this does present a complication compared to our current workflow for scraping/archiving. As it stands now, we would create a zipfile with a years worth of data, upload it to zenodo, and that would basically be the end of it. Now we would create a yearly archive that would need to be continually updated. What we don't want to happen would be creating a new archive that only contains the latest filings, not all filings for the year, and overwriting the old yearly archive on zenodo. I see a couple possible solutions for avoiding this problem:

Option 1

Every time we create a new archive, parse all available feeds and download all filings, even those that had been previously download.

Pros
Cons

Option 2

Allow the scraper to access archives from zenodo so it can truly update the existing archives. If the scraper has access to an existing yearly archive, it can simply append new filings to it, then the new archive can be uploaded.

Pros
Cons

Option 3

Change the archive partitioning strategy to create monthly partitions with new filings submitted from each month. The Datastore would then need to be able to assemble yearly partitions of filings to be used by PUDL

Pros
Cons

I think I personally lean towards starting with option 1, with perhaps a goal of moving towards option 2 in the future, but I'd like some feedback from others before getting too far into this.

zaneselvans commented 2 years ago

I'm pretty strongly in support of Option 1. If we don't download all the data, we can't know if any of it has changed retroactively. If we only wanted to update incrementally when things have changed, we'd still have to download everything and calculate the hash to know if it's actually the same as what's already been archived, or whether an updated version is needed. All of the agencies seem prone to updating the data without notice long after its original publication.

For example, when I created the new EIA-861 2021ER archive yesterday, it detected that all the years of data from 2013-2020 had also been changed! So it replaced those files in the archive too. On the back end Zenodo only stores new files when they have a different hash -- otherwise new versions of the archives refer to previous versions of the files that have already been stored.

I think any kind of incremental updating would be a significant change to the meaning of these archives, and would be making assumptions about the stability of agency published data that we have seen violated over and over again. I've been thinking of them as snapshots -- what you would get from the agency if you went and downloaded all of the data on the day that the archive was created.

This understanding could be problematic at some point, if an agency (say) decided to start deleting old data after 10 years... at which point we would need to cobble together new archives relying both on newly gathered data, and no longer available data that we already have archives of. Thankfully we haven't gotten to that point (yet), though it was close there with FERC deleting all the old DBF data!