HazyResearch / pdftotree

:evergreen_tree: A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.
MIT License
428 stars 90 forks source link

Inconsistent data models for bbox #87

Open HiromuHota opened 3 years ago

HiromuHota commented 3 years ago

Describe the bug

Data models that represent bounding boxes are inconsistent, which considerably degrades readability. For example,

bbox: List[float] in the order of (y0, x0, y1, x1) at

https://github.com/HazyResearch/pdftotree/blob/6ff4a7cb5fe6269e3c287664392e226ca45479d4/pdftotree/TreeExtract.py#L447

bbox: Tuple[float] in the order of (x0, y0, x1, y1) at

https://github.com/HazyResearch/pdftotree/blob/6ff4a7cb5fe6269e3c287664392e226ca45479d4/pdftotree/TreeExtract.py#L456

word[1:]: List[float] in the order of (y0, x0, y1, x1) at

https://github.com/HazyResearch/pdftotree/blob/6ff4a7cb5fe6269e3c287664392e226ca45479d4/pdftotree/TreeExtract.py#L458-L463

To Reproduce

N/A

Expected behavior

I expect that they are consistent.

Error Logs/Screenshots

N/A

Environment (please complete the following information):

Additional context Add any other context about the problem here.

See discussions at https://github.com/HazyResearch/pdftotree/pull/84#discussion_r502049405