HazyResearch / pdftotree

:evergreen_tree: A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.
MIT License
434 stars 92 forks source link

extract_tables missing function 'analyze_pages' from ./utils/pdf/pdf_utils.py #115

Closed JBBalling closed 3 years ago

JBBalling commented 3 years ago

Hello there,

I' am interested in your module to generate HTML-Documents from PDF-Documents, especially in terms of table extraction with fonduer. Unfortunately the table-extraction/table-conversion from PDF to HTML didn't achieve good results for my Examples. Therefore I tried your ML-Approach for Table-Detection to train a ML-Model for my purpose within extract_tables.

When attempting to run extract_tables from the CLI. I got following error:

Traceback (most recent call last): File "/home/julian/anaconda3/envs/layoutP/bin/extract_tables", line 13, in from pdftotree.ml.TableExtractML import TableExtractorML File "/home/julian/anaconda3/envs/layoutP/lib/python3.7/site-packages/pdftotree/ml/TableExtractML.py", line 21, in from pdftotree.utils.pdf.pdf_utils import analyze_pages, normalize_pdf ImportError: cannot import name 'analyze_pages' from 'pdftotree.utils.pdf.pdf_utils' (/home/julian/anaconda3/envs/layoutP/lib/python3.7/site-packages/pdftotree/utils/pdf/pdf_utils.py)

Test Scenario: Distributor ID: Ubuntu Description: Ubuntu 18.04.5 LTS Release: 18.04 Codename: bionic Python: 3.7

As the error says, there is a function 'analyze_pages' missing in the current repo. Is there an update coming soon which fixes this issue? Thank you in advance! Julian

lukehsiao commented 3 years ago

This looks like a duplicate of #105, which was resolved with #106. Please try the version on master.