allenai / mmda

multimodal document analysis
Apache License 2.0
159 stars 18 forks source link

get spans with _annotate_box_group even if token boxes are on box_group instead of span #217

Closed geli-gel closed 1 year ago

geli-gel commented 1 year ago

part of https://github.com/allenai/scholar/issues/35862 -- ran into issue where we're unable to determine underlying text from box_groups because SPP annotation store keeps PdfPlumber output in "SpanGroups with BoxGroups" format, so the token box that's expected to be on the token span's .box is no longer found there. This PR updates _annotate_box_group so that we check a page of tokens to determine where the token's box is stored, and pass that information on to allocate_overlapping_tokens_for_box so that tool also knows where to check for the boxes, as well as adds a method to MergeSpans so that a default MergeSpans class with spans can be instantiated with SpanGroups instead (specifically for this use case).

todo: