allenai / mmda

multimodal document analysis
Apache License 2.0
158 stars 18 forks source link

wip - fix dict word predictor #173

Closed kyleclo closed 1 year ago

kyleclo commented 1 year ago
  1. define a new whitespace tokenization predictor
  2. add IDs to rows and pages in PDFPlumber
  3. add tokenizers to setup
  4. add new DictWordPredictor logic
kyleclo commented 1 year ago

waiting @geli-gel approval to make sure doesnt break anything in orchestration code