Closed katie-lamb closed 2 months ago
@katie-lamb this is actually "in-progress" right? My understanding is that it's in a good spot to hand off to @zschira for the infrastructure part - maybe it makes sense to split up the R&D piece and the productionizing piece into two tickets?
@jdangerx yep, this has been started but I haven't done anything on it in a week or so. I think it's in a good place to hand off to Zach, but if he's at capacity then I can start chipping away at the infrastructure part. I agree this could be split into smaller tickets.
@zschira last known status is "this seems to mostly work but we haven't tried running it on a VM on a sizeable subset of the real data yet" - is that still the case?
Looks like this was closed by #48 , but the closing keywords didn't trigger. @zschira lmk if that's wrong.
The branch I'm working off is
main-10k-extraction
and this notebook contains a kind of janky extraction function that utilizes regexes. Does this feel like it's at the point where it can go out of notebook land?The basic company information (name, CIK, address, etc) is contained at the top of the SEC 10K HTML files and is much easier to extract than the ownership information contained in the Ex. 21 files. I think we'd like to extract this information for all companies (not just those that file an Ex. 21) if possible. It might help down the line to have all the companies, and it shouldn't be much more work to create a database with all of them. In the HTML, it looks like this: