Removed paragraph layout docs from validation data
Handled subsidiaries that spanned multiple rows and were grouped into one entity
Removes duplicate rows from extracted tables
Testing
How did you make sure this worked? How can a reviewer verify this?
To-do list
- [x] Review the PR yourself and call out any questions or issues you have
- [x] Try to use Corpwatch dataset as validation - moved to separate issue
- [x] Maybe: Create threshold for failures based on logits - moved to separate issue
- [x] Create plan for paragraph layout document classifier
Overview
Closes #78 .
What problem does this address?
Testing
How did you make sure this worked? How can a reviewer verify this?
To-do list