allenai / mmda

multimodal document analysis
Apache License 2.0
158 stars 18 forks source link

Boxgrps to spangrps tool #242

Closed geli-gel closed 1 year ago

geli-gel commented 1 year ago

Fix for https://github.com/allenai/scholar/issues/36624 tool to get spangroups from boxgroups using new is_overlap(center=True) logic to emulate LayoutParser's .filter_by(center=True) which was being used in bib entry detector local testing, and giving better texts than SPP which was using default MMDA _annotate_box_groups.

Basically takes _annotate_box_groups and moves it to tools.py and adds a couple changes (like default to assume we want centers of tokens only, and adds a little x padding to make sure we grab those narrow tokens like "[" which would be missed without the extra padding.