catalyst-cooperative / mozilla-sec-eia

Exploratory development for SEC to EIA linkage
MIT License
0 stars 0 forks source link

Running list of Ex. 21 extraction model improvement ideas #88

Open katie-lamb opened 1 month ago

katie-lamb commented 1 month ago

Overview

This is a running list of improvements that we could try implementing to improve the performance of the Ex. 21 extraction model. I moved any "nice to have" straggler items from #78 into this issue. These items can be experimented with after record linkage, when we have a better idea of remaining budget and performance needs.

### Next steps
- [ ] Use Corpwatch dataset for further validation
- [ ] Nice to have: breakout `layoutlm-finetune` into ops
- [ ] Try clustering the final hidden states instead of using heuristic based table extractor
- [ ] Exclude anything below "Footnotes" or similar keywords
- [ ] Create threshold for entity classification failure based on logits returned by LayoutLM