HazyResearch / pdftotree

:evergreen_tree: A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.
MIT License
434 stars 92 forks source link

Use the centroid to check if textlines are within a bbox #97

Closed HiromuHota closed 4 years ago

HiromuHota commented 4 years ago

Description of the problems or issues

Is your pull request related to a problem? Please describe.

See #96

Does your pull request fix any issue.

Fix #96

Description of the proposed changes

Use the centroid to check if textlines are within a bbox instead of its whole bbox. This change makes the isContained check loose, which tolerates a fluctuating bbox returned from Tabula.

Test plan

Check if cell values are not missed and extracted.

Checklist