catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
474 stars 108 forks source link

Archive SEC Ex. 21 Documents #3308

Closed katie-lamb closed 6 months ago

katie-lamb commented 8 months ago

We are interested in archiving Ex. 21 of the SEC 10K filings which is a PDF attachment that's not published as part of the 10K XBRL filings.

CorpWatch has scraped data describing these relationships from the Ex.21 attachment of SEC 10-K filings. They make the data freely available for download in bulk or via a RESTful API and it seems as if it just got updated. However the ownership percentage isn't reported in this Corpwatch bulk download, which is necessary for this project. This means we need to get the original Ex. 21 PDFs and extract data from them. The Corpwatch bulk download is still helpful to us because we can use this as validation data for the OCR outputs, but that's for a different day.

Let's start with some exploratory work to figure out the best way to scrape these PDFs and how to organize them. Questions like what's the best way to access data, what are some challenges we might run into if we make a bunch of calls to these sites? Who else has tried to do this besides Corpwatch? Where should we put all these PDFs?

Here's the Corpwatch source code for getting the 10Ks (need to do more poking in that repo). It seems like most of the codebase is like 13+ years old, and written in Perl but it might be a helpful starting point.

The data can be accessed through the SEC EDGAR portal and it seems like you can open individual PDFs. You can filter by 10K filings there (I got this result).

Other links:

### Tasks
- [x] Investigate methods for accessing SEC 10K Ex. 21 PDFs and other projects that scraped SEC data
- [x] Estimate the size of all the downloaded PDFs
- [x] Figure out how to bulk download these PDFs and make a bucket to dump them in
- [x] Set up way to also collect metadata on the PDFs we're downloading
- [ ] Figure out the best way to structure this pipeline and where the PDFs should live. Set up database for these PDFs and metadata - decouple from Zenodo and write straight to bucket in GCS
zschira commented 8 months ago

Exploratory findings

Accessing filings

After some research it seems accessing the Exhibit 21 docs might be easier than expected, but there are still some oddities we'll need to deal with. This page on accessing SEC edgar data points towards index files they distribute, which list all available filings, partitioned by year and quarter. From these indexes, we can sort to only 10-k filings, and access indexes specific to each filing. Here is one example of such an index. These indexes list the filing itself as well as all exhibits available. From some manual exploration, it appears that indexes for older filings only contain a link to the entire filing (10-K plus all exhibits), while in newer filings you can access exhibits individually in a text (html) format. Here is an example of an index for one of these older filings.

As for scale of all of the raw filings, I think it will be well within the range we are used to working with with other datasets. I'll need to parse all of the indexes to see how many 10-K filings there are, and how many of those contain exhibit 21's, but it seems likely to be on the order of 10s of thousands, to a few 100 thousand filings. Some example exhibit 21's I've downloaded are ~1kb, while the entire 10-k file is a couple hundred kb. So, for a high estimate if we downloaded all of the 10-k's we might be looking at 250kb x 250,000 filings ~= 50GB. We can also probably just use the indexes distributed by the SEC for metadata. These contain the company name, Central Index Key (CIK is a unique SEC identifier for filers, form type, date filed, and file name.

Format inconsistencies

Assuming we can access all of the exhibit 21 documents in a text format, there are still some inconsistencies in format, which we will need to address. It seems that all of the exhibit 21 documents contain an HTML Table of some sort, but some contain ownership percentage while others don't, some have a hierarchical structure displaying subsidiaries of subsidiaries, etc. It seems like there might be a couple common patterns, that hopefully will apply to all or most filings, but we'll need a more comprehensive look at filings to have any certainty on this.

katie-lamb commented 8 months ago

Adding a link to the current Zenodo archive

katie-lamb commented 7 months ago

3/7 check in update: working on decoupling the archiver from Zenodo so that we can write directly to GCS, Dazhong and Zach are going to run through the Terraform setup this PR to get the cloud bucket in place and get the database created.