HazyResearch / pdftotree

:evergreen_tree: A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.
MIT License
434 stars 92 forks source link

Getting error `SEVERE: Cannot read JBIG2 image: jbig2-imageio is not installed` #117

Open pgarz opened 3 years ago

pgarz commented 3 years ago

Describe the bug A clear and concise description of what the bug is.

I'm getting the following stack trace error when running pdftotree on a PDF that contains scientific chemical information:

SEVERE: Cannot read JBIG2 image: jbig2-imageio is not installed
[DEBUG] pdftotree.TreeExtract - Tabula recognized 0 table(s).
Traceback (most recent call last):
  File "/opt/anaconda3/envs/noble_app_env/bin/pdftotree", line 94, in <module>
    args.visualize,
  File "/opt/anaconda3/envs/noble_app_env/lib/python3.7/site-packages/pdftotree/core.py", line 66, in parse
    pdf_html = extractor.get_html_tree()
  File "/opt/anaconda3/envs/noble_app_env/lib/python3.7/site-packages/pdftotree/TreeExtract.py", line 319, in get_html_tree
    page.appendChild(table_element)
  File "/opt/anaconda3/envs/noble_app_env/lib/python3.7/xml/dom/minidom.py", line 114, in appendChild
    if node.nodeType == self.DOCUMENT_FRAGMENT_NODE:
AttributeError: 'NoneType' object has no attribute 'nodeType'

I've installed the latest Java version for Mac OS X. pdftotree seems to work just fine on simple PDFs. I've also haven't been able to figure out how to even attempt trying to install jbig2-imageio manually. I'm not familiar with how to install that JAR file into the pdftotree installation

To Reproduce Steps to reproduce the behavior:

  1. Install the Java JDK for Mac OSK
  2. Install ImageMagick with brew
  3. Attempt to run hOCR extraction with pdftotree on a file with chemical molecule images

Expected behavior A clear and concise description of what you expected to happen.

For the proper hOCR output to be generated and for the command to execute successfully

Error Logs/Screenshots If applicable, add error logs or screenshots to help explain your problem.

Environment (please complete the following information):

Additional context Add any other context about the problem here.

redbrain commented 1 year ago

Same here, any updates on this?