catalyst-cooperative / mozilla-sec-eia

Exploratory development for SEC to EIA linkage
MIT License
0 stars 0 forks source link

Extract basic company info from SEC 10K filings #10

Closed katie-lamb closed 2 months ago

katie-lamb commented 6 months ago

The branch I'm working off is main-10k-extraction and this notebook contains a kind of janky extraction function that utilizes regexes. Does this feel like it's at the point where it can go out of notebook land?

The basic company information (name, CIK, address, etc) is contained at the top of the SEC 10K HTML files and is much easier to extract than the ownership information contained in the Ex. 21 files. I think we'd like to extract this information for all companies (not just those that file an Ex. 21) if possible. It might help down the line to have all the companies, and it shouldn't be much more work to create a database with all of them. In the HTML, it looks like this:

COMPANY DATA:   
    COMPANY CONFORMED NAME:         TRANS LUX CORP
    CENTRAL INDEX KEY:          0000099106
    STANDARD INDUSTRIAL CLASSIFICATION: MISCELLANEOUS MANUFACTURING INDUSTRIES [3990]
    IRS NUMBER:             131394750
    STATE OF INCORPORATION:         DE
    FISCAL YEAR END:            1231

FILING VALUES:
    FORM TYPE:      10-K
    SEC ACT:        1934 Act
    SEC FILE NUMBER:    001-02257
    FILM NUMBER:        13949612

BUSINESS ADDRESS:   
    STREET 1:       26 PEARL STREET
    CITY:           NORWALK
    STATE:          CT
    ZIP:            06850-1647
    BUSINESS PHONE:     2038534321

MAIL ADDRESS:   
    STREET 1:       26 PEARL STREET
    CITY:           NORWALK
    STATE:          CT
    ZIP:            06850-1647
### Modeling Tasks
- [x] Build a simple regex-based key: value template to extract field values that fit this exact format. Add extraction regexes that match the value fields
- [x] Extract basic company information into a database where CIK is primary key (need to address #20 )
- [ ] Check which docs aren't captured by this template and catalog differing formats
- [ ] Validate with GEM data
- [x] Report a recall rate of the filing companies
- [ ] Look at PyMuPDF function for this https://pymupdf.readthedocs.io/en/latest/recipes-text.html#how-to-extract-key-value-pairs-from-a-page
### Infrastructure Tasks
- [x] Process HTML docs in batches
- [x] Write extracted data to a database
- [x] Track progress and the percentage of docs that have been processed
- [x] Track recall rate / metrics with experiment tracking
- [ ] Functions to validate against GEM data
jdangerx commented 5 months ago

@katie-lamb this is actually "in-progress" right? My understanding is that it's in a good spot to hand off to @zschira for the infrastructure part - maybe it makes sense to split up the R&D piece and the productionizing piece into two tickets?

katie-lamb commented 5 months ago

@jdangerx yep, this has been started but I haven't done anything on it in a week or so. I think it's in a good place to hand off to Zach, but if he's at capacity then I can start chipping away at the infrastructure part. I agree this could be split into smaller tickets.

jdangerx commented 4 months ago

@zschira last known status is "this seems to mostly work but we haven't tried running it on a VM on a sizeable subset of the real data yet" - is that still the case?

jdangerx commented 2 months ago

Looks like this was closed by #48 , but the closing keywords didn't trigger. @zschira lmk if that's wrong.