Spinning off the unresolved tasks from #68 into a separate issue. Also, post-Dagsterization, some of these changes become easier to implement in a separate PR from the original, so it made sense to break up into a second pass issue.
Example of filing that should fail and not create a PDF:
92416-0000892569-94-000102
Example of filing that isn't rendered well with HTML:
9342-0000009342-95-000008
Example of filing with "footnotes":
103872-0001193125-13-444053
### Next steps
- [ ] Use Corpwatch dataset for further validation
- [x] More comparison of validation and extracted dataframes with more flexible ergonomic improvements to `GCSArchive`
- [x] Add an x and y threshold for distinct subsidiary cutoffs
- [x] Don't create PDFs for empty tables or Ex. 21 tables that can't be rendered, skip instead. Also handle "crossed out" filings
- [x] Zach: Look into what's happening with HTMLs that are rendered without table structure or in "paragraph" form
- [x] Look at the layouts that are most commonly "missed"
- [x] Try taking out "paragraph" layout docs and having same filer from different years replace that entry
- [ ] Nice to have: breakout `layoutlm-finetune` into ops
- [ ] Nice to have: Try clustering the final hidden states instead of using heuristic based table extractor
- [ ] Nice to have: Exclude anything below "Footnotes" or similar keywords
Overview
Spinning off the unresolved tasks from #68 into a separate issue. Also, post-Dagsterization, some of these changes become easier to implement in a separate PR from the original, so it made sense to break up into a second pass issue.
Example of filing that should fail and not create a PDF:
Example of filing with "footnotes": 103872-0001193125-13-444053