allenai / mmda

multimodal document analysis
Apache License 2.0
158 stars 18 forks source link

group boxes into lines with tolerance #274

Closed dmh43 closed 10 months ago

dmh43 commented 11 months ago

This PR groups boxes into lines without assuming a perfect match of box.t. We use the 3rd decimal point which seems small enough but also big enough to catch most cases.

This PR also merges adjacent boxes belonging to a mention which might have 2 spans that are far apart. Instead of not doing anything in that case, it only merges boxes with associated spans that are close.

geli-gel commented 11 months ago

Looks like this will improve things but wondering if we could take advantage of PDFPlumber's line segmentation (rows) downstream to decide how to draw boxes

geli-gel commented 11 months ago

Just realized a version bump is needed here, I'm setting it to 0.9.11 in my PR so you can take 0.9.12