Fix for sg overlap error in box_groups_to_span_groups when center=true

allenai / mmda

multimodal document analysis

Apache License 2.0

158 stars 18 forks source link

Solution for https://github.com/allenai/scholar/issues/38452 The problem was in running box_groups_to_span_groups with center=True. We want to keep center=True because it allows us to get good texts in most cases when converting from Grobid box groups to MMDA spangroups. However in some cases it was causing a problem. The overlapping spangroups error seen from spp-grobid was always a single character span found in 2 different spangroups. This was due to single character tokens that were sometimes not found to be overlapping with the box when running allocate_overlapping_tokens_for_box, but then got swallowed up with MergeSpans, and we weren't accounting for this. We now update the dictionary containing the remaining allocatable tokens after merge_neighbor_spans_by_symbol_distance is used to derive a SpanGroup with all the tokens used (instead of just the tokens whose centers overlapped with the original box)

PDF_PATH = 'pdfs/3d5c8c04f42be8bc2fd6038e6f4099bbcfaa0c54.pdf' XML_PATH = '3d5c8c04f42be8bc2fd6038e6f4099bbcfaa0c54.xml' from mmda.parsers import PDFPlumberParser from mmda.types import Document pdf_plumber = PDFPlumberParser() doc: Document = pdf_plumber.parse(input_pdf_path=PDF_PATH) doc.fields from mmda.parsers.grobid_augment_existing_document_parser import GrobidAugmentExistingDocumentParser parser = GrobidAugmentExistingDocumentParser(config_path='../../src/mmda/parsers/grobid.config', check_server=False) xml = open(XML_PATH).read() doc = parser._parse_xml_onto_doc(xml, doc) # error happens here when trying to annotate sections onto the doc

allenai / mmda

Fix for sg overlap error in box_groups_to_span_groups when center=true #276