allenai / mmda

multimodal document analysis
Apache License 2.0
159 stars 18 forks source link

Fix citation mentions #226

Closed cmwilhelm closed 1 year ago

cmwilhelm commented 1 year ago

RE: https://github.com/allenai/scholar/issues/36386#issuecomment-1516825407

Changes in PDFPlumberParser behavior since the citations mentions model was last updated had caused its integration tests to fail

This changeset does two things:

1) Changes the test assertions for citation mentions' TIMO integration tests to be text value based, rather than span position based 2) Adds a new test to MMDA's suite that verifies PDFPlumberParser stability -- will alert when our extracted text, tokenization, or bboxes change at this layer, as a signal to reevaluate the rest of the DAG or revert changes.