Closed kyleclo closed 2 years ago
thx; made the bump in setup
that's right @cmwilhelm , we'd generally be asking MMDA users to use our default setting & specify their own splitting strategy at their own risk.
there are downsides to this in that changes to this default are less visible, but that's up to the users of the library to decide if they want to maintain magicstring
on their end
this PR handles 2 things:
make default usage of Pdfplumber predictor to be our current most recommended setting (the setting that is most conducive for citation mention detection). we want everyone to be developing their models off as similar pdf tokens data as possible.
slight modification to citation mention detection model to import its split punctuation variable from pdfplumber parser itself; minimize risk that they deviate over time