part of https://github.com/allenai/scholar/issues/35862 -- ran into issue where we're unable to determine underlying text from box_groups because SPP annotation store keeps PdfPlumber output in "SpanGroups with BoxGroups" format, so the token box that's expected to be on the token span's .box is no longer found there.
This PR updates _annotate_box_group so that we check a page of tokens to determine where the token's box is stored, and pass that information on to allocate_overlapping_tokens_for_box so that tool also knows where to check for the boxes, as well as adds a method to MergeSpans so that a default MergeSpans class with spans can be instantiated with SpanGroups instead (specifically for this use case).
part of https://github.com/allenai/scholar/issues/35862 -- ran into issue where we're unable to determine underlying text from box_groups because SPP annotation store keeps PdfPlumber output in "SpanGroups with BoxGroups" format, so the token box that's expected to be on the token span's
.box
is no longer found there. This PR updates_annotate_box_group
so that we check a page of tokens to determine where the token's box is stored, and pass that information on toallocate_overlapping_tokens_for_box
so that tool also knows where to check for the boxes, as well as adds a method toMergeSpans
so that a defaultMergeSpans
class with spans can be instantiated with SpanGroups instead (specifically for this use case).todo: