Closed TobiasJu closed 4 years ago
Because PDFMiner takes an extra step for each character. This is necessary to decode non-ASCII text, whose encoding is often arbitrary and sometimes described within the PDF in the form of CMap. If you can parse all the PDFs successfully with PyPDF2, that's fine. But overall I'd say PDFMiner has a higher accuracy of extracting text, although it's still not perfect.
Thanks for clarifying, i will double check the output of PyPDF2 and compare it to pdfminer.
You might be curious about https://github.com/py-pdf/benchmarks :-)
So i used the pdfminer lib and its functional, but sadly there is one big problem, which makes this lib completly irrelevant for me. It is too slow. I'll give you an example from: http://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/ using this free PDF: https://web.stanford.edu/~jurafsky/slp3/edbook_oct162019.pdf
This script takes about 54,8s for parsing one document. While the same implementation with PyPDF2 just takes 11,3s.
I am planning to parse 1000 to 10000 PDFs and PyPDF seems to be 5 times faster, so its the obvious choice here.
Can you elaborate on this?