HazyResearch / pdftotree

:evergreen_tree: A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.
MIT License
434 stars 93 forks source link

Switch from Tabula to Camelot? #78

Open HiromuHota opened 4 years ago

HiromuHota commented 4 years ago

Is your feature request related to a problem? Please describe.

Switching from Tabula to Camelot have two advantages:

  1. Tabula is Java, Camelot is Python. Switching to Camelot frees us from Java.
  2. Seems like Camelot performs better on table recognition.

Describe the solution you'd like

I'd like to switch from Tabula to Camelot if it makes more sense. Currently, pdftotree detects table "area" (either ml, vision, or heuristic) and uses Tabula for table recognition. I'd have to figure out if Camelot takes area argument like Tabula does.

Describe alternatives you've considered

It should be fine even if Camelot does not take area but detects tables well on its own.

Additional context Add any other context or screenshots about the feature request here.

According to https://arxiv.org/pdf/1911.10683.pdf,

Camelot is the best off-the-shelf tool in this comparison.

lukehsiao commented 4 years ago

In general, simplifying dependencies sounds like a big win to me, esp if performance is comparable or better.

HiromuHota commented 4 years ago

@lukehsiao thanks for your thoughts.

I just confirmed that Camelot allows to specify table areas (and pages). https://camelot-py.readthedocs.io/en/master/user/advanced.html#specify-table-areas