Closed geli-gel closed 10 months ago
To reproduce the overlap error seen in the ddog log, download grobid xml here and run the following (this runs in a nb located in examples/grobid_augment_existing_document_parser/notebook.ipynb):
PDF_PATH = 'pdfs/3d5c8c04f42be8bc2fd6038e6f4099bbcfaa0c54.pdf'
XML_PATH = '3d5c8c04f42be8bc2fd6038e6f4099bbcfaa0c54.xml'
from mmda.parsers import PDFPlumberParser
from mmda.types import Document
pdf_plumber = PDFPlumberParser()
doc: Document = pdf_plumber.parse(input_pdf_path=PDF_PATH)
doc.fields
from mmda.parsers.grobid_augment_existing_document_parser import GrobidAugmentExistingDocumentParser
parser = GrobidAugmentExistingDocumentParser(config_path='../../src/mmda/parsers/grobid.config', check_server=False)
xml = open(XML_PATH).read()
doc = parser._parse_xml_onto_doc(xml, doc) # error happens here when trying to annotate sections onto the doc
Solution for https://github.com/allenai/scholar/issues/38452 The problem was in running
box_groups_to_span_groups
withcenter=True
. We want to keepcenter=True
because it allows us to get good texts in most cases when converting from Grobid box groups to MMDA spangroups. However in some cases it was causing a problem. The overlapping spangroups error seen from spp-grobid was always a single character span found in 2 different spangroups. This was due to single character tokens that were sometimes not found to be overlapping with the box when runningallocate_overlapping_tokens_for_box
, but then got swallowed up with MergeSpans, and we weren't accounting for this. We now update the dictionary containing the remaining allocatable tokens aftermerge_neighbor_spans_by_symbol_distance
is used to derive a SpanGroup with all the tokens used (instead of just the tokens whose centers overlapped with the original box)