This is a running list of improvements that we could try implementing to improve the performance of the Ex. 21 extraction model. I moved any "nice to have" straggler items from #78 into this issue. These items can be experimented with after record linkage, when we have a better idea of remaining budget and performance needs.
Example of a filing with a "footnotes" section that can be excluded:
103872-0001193125-13-444053
### Next steps
- [ ] Use Corpwatch dataset for further validation
- [ ] Nice to have: breakout `layoutlm-finetune` into ops
- [ ] Try clustering the final hidden states instead of using heuristic based table extractor
- [ ] Exclude anything below "Footnotes" or similar keywords
- [ ] Create threshold for entity classification failure based on logits returned by LayoutLM
Overview
This is a running list of improvements that we could try implementing to improve the performance of the Ex. 21 extraction model. I moved any "nice to have" straggler items from #78 into this issue. These items can be experimented with after record linkage, when we have a better idea of remaining budget and performance needs.