allenai / mmda

multimodal document analysis
Apache License 2.0
158 stars 18 forks source link

Fix for sg overlap error in box_groups_to_span_groups when center=true #276

Closed geli-gel closed 10 months ago

geli-gel commented 10 months ago

Solution for https://github.com/allenai/scholar/issues/38452 The problem was in running box_groups_to_span_groups with center=True. We want to keep center=True because it allows us to get good texts in most cases when converting from Grobid box groups to MMDA spangroups. However in some cases it was causing a problem. The overlapping spangroups error seen from spp-grobid was always a single character span found in 2 different spangroups. This was due to single character tokens that were sometimes not found to be overlapping with the box when running allocate_overlapping_tokens_for_box, but then got swallowed up with MergeSpans, and we weren't accounting for this. We now update the dictionary containing the remaining allocatable tokens after merge_neighbor_spans_by_symbol_distance is used to derive a SpanGroup with all the tokens used (instead of just the tokens whose centers overlapped with the original box)

geli-gel commented 10 months ago

To reproduce the overlap error seen in the ddog log, download grobid xml here and run the following (this runs in a nb located in examples/grobid_augment_existing_document_parser/notebook.ipynb):

PDF_PATH = 'pdfs/3d5c8c04f42be8bc2fd6038e6f4099bbcfaa0c54.pdf'
XML_PATH = '3d5c8c04f42be8bc2fd6038e6f4099bbcfaa0c54.xml'

from mmda.parsers import PDFPlumberParser
from mmda.types import Document

pdf_plumber = PDFPlumberParser()
doc: Document = pdf_plumber.parse(input_pdf_path=PDF_PATH)
doc.fields

from mmda.parsers.grobid_augment_existing_document_parser import GrobidAugmentExistingDocumentParser
parser = GrobidAugmentExistingDocumentParser(config_path='../../src/mmda/parsers/grobid.config', check_server=False)

xml = open(XML_PATH).read()
doc = parser._parse_xml_onto_doc(xml, doc) # error happens here when trying to annotate sections onto the doc