catalyst-cooperative / mozilla-sec-eia

Exploratory development for SEC to EIA linkage
MIT License
0 stars 0 forks source link

Second Pass Ex. 21 Model Improvements #78

Open katie-lamb opened 1 month ago

katie-lamb commented 1 month ago

Overview

Spinning off the unresolved tasks from #68 into a separate issue. Also, post-Dagsterization, some of these changes become easier to implement in a separate PR from the original, so it made sense to break up into a second pass issue.

Example of filing that should fail and not create a PDF:

Example of filing with "footnotes": 103872-0001193125-13-444053

### Next steps
- [ ] Use Corpwatch dataset for further validation
- [x] More comparison of validation and extracted dataframes with more flexible ergonomic improvements to `GCSArchive`
- [x] Add an x and y threshold for distinct subsidiary cutoffs
- [x] Don't create PDFs for empty tables or Ex. 21 tables that can't be rendered, skip instead. Also handle "crossed out" filings
- [x] Zach: Look into what's happening with HTMLs that are rendered without table structure or in "paragraph" form
- [x] Look at the layouts that are most commonly "missed"
- [x] Try taking out "paragraph" layout docs and having same filer from different years replace that entry
- [ ] Nice to have: breakout `layoutlm-finetune` into ops
- [ ] Nice to have: Try clustering the final hidden states instead of using heuristic based table extractor
- [ ] Nice to have: Exclude anything below "Footnotes" or similar keywords