allenai / mmda

multimodal document analysis
Apache License 2.0
158 stars 18 forks source link

Kyle/2022 09/pdfplumber symbols #138

Closed kyleclo closed 2 years ago

kyleclo commented 2 years ago

this PR handles 2 things:

  1. make default usage of Pdfplumber predictor to be our current most recommended setting (the setting that is most conducive for citation mention detection). we want everyone to be developing their models off as similar pdf tokens data as possible.

  2. slight modification to citation mention detection model to import its split punctuation variable from pdfplumber parser itself; minimize risk that they deviate over time

kyleclo commented 2 years ago

thx; made the bump in setup

that's right @cmwilhelm , we'd generally be asking MMDA users to use our default setting & specify their own splitting strategy at their own risk.

there are downsides to this in that changes to this default are less visible, but that's up to the users of the library to decide if they want to maintain magicstring on their end