HazyResearch / pdftotree

:evergreen_tree: A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.
MIT License
434 stars 92 forks source link

Duplicate text and table in the extraction result #121

Open yetnikoff opened 2 years ago

yetnikoff commented 2 years ago

Describe the bug the first page and the second page of the ouput contain the same text. page 4 and 5 are the same thing as well.

To Reproduce Steps to reproduce the behavior:

  1. pdf downloaded from : AIMCO.pdf
  2. Execute the code : pdftotree.parse(\PATH\TO\AIMCO-2019.pdf, html_path=\PATH\TO\output.html,visualize=False)
  3. check hOCR output

Expected behavior each page of the output file to have their own texts and tables.

Error Logs/Screenshots

Environment (please complete the following information):

Additional context if that issue suppose to happen, would it be possible to have a variable to keep track of text and table already extracted? (i am not very experienced in programming).