euske / pdfminer

Python PDF Parser (Not actively maintained). Check out pdfminer.six.
https://github.com/pdfminer/pdfminer.six
MIT License
5.25k stars 1.13k forks source link

pdfminer vs PyPDF2 parsing speed #262

Closed TobiasJu closed 4 years ago

TobiasJu commented 4 years ago

So i used the pdfminer lib and its functional, but sadly there is one big problem, which makes this lib completly irrelevant for me. It is too slow. I'll give you an example from: http://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/ using this free PDF: https://web.stanford.edu/~jurafsky/slp3/edbook_oct162019.pdf

import io

from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage

def extract_text_by_page(pdf_path):
    with open(pdf_path, 'rb') as fh:
        for page in PDFPage.get_pages(fh, 
                                      caching=True,
                                      check_extractable=True):
            resource_manager = PDFResourceManager()
            fake_file_handle = io.StringIO()
            converter = TextConverter(resource_manager, fake_file_handle)
            page_interpreter = PDFPageInterpreter(resource_manager, converter)
            page_interpreter.process_page(page)

            text = fake_file_handle.getvalue()
            yield text

            # close open handles
            converter.close()
            fake_file_handle.close()

def extract_text(pdf_path):
    for page in extract_text_by_page(pdf_path):
        print(page)
        print()

if __name__ == '__main__':
    extract_text('edbook_oct162019.pdf')

This script takes about 54,8s for parsing one document. While the same implementation with PyPDF2 just takes 11,3s.

I am planning to parse 1000 to 10000 PDFs and PyPDF seems to be 5 times faster, so its the obvious choice here.

Can you elaborate on this?

euske commented 4 years ago

Because PDFMiner takes an extra step for each character. This is necessary to decode non-ASCII text, whose encoding is often arbitrary and sometimes described within the PDF in the form of CMap. If you can parse all the PDFs successfully with PyPDF2, that's fine. But overall I'd say PDFMiner has a higher accuracy of extracting text, although it's still not perfect.

TobiasJu commented 4 years ago

Thanks for clarifying, i will double check the output of PyPDF2 and compare it to pdfminer.

MartinThoma commented 1 year ago

You might be curious about https://github.com/py-pdf/benchmarks :-)