camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
2.91k stars 462 forks source link

Is there any plan to remove dependency of PyPDF2? #215

Open kiyo-matsu opened 3 years ago

kiyo-matsu commented 3 years ago

Thank you for providing a very powerful, useful library. I want to extract tables from various pdfs, but often, when I read a pdf by camelot.readPDF function, it gets error from pypdf2 like below.

This library seems to using 2 pdf libraries, pypdf2 and pdfminer.six and main functionality to extract text from pdfs is seemed to be dependent to pdfminer.six. I think this library can consist without pypdf2, with considering PyPDF2 is not maintained since 2018.

Regards.

kiyo-matsu commented 3 years ago

I implemented camelot by using pymupdf instead of pypdf2. And almost all pdfs I have become readable and parsable!

https://github.com/kiyo-matsu/camelot/tree/use-pymupdf

Arnie97 commented 3 years ago

Thanks for your great contributions! I also encountered the PyPDF2 problems occasionally, and your fork fixed them for me. The code need to be adjusted slightly to work with PyMuPDF v1.18.5 though:

diff --git i/camelot/handlers.py w/camelot/handlers.py
--- i/camelot/handlers.py
+++ w/camelot/handlers.py
@@ -114,7 +114,7 @@ class PDFHandler(object):
             outfile = fitz.open()
             outpage = outfile.newPage(-1, width=p.rect.width,
                                       height=p.rect.height)
-            outpage.showPDFpage(outpage.rect, infile, page - 1)
+            outpage.showPDFpage(outpage.rect, infile, pno=page-1)
             outfile.save(fpath)

             layout, dim = get_page_layout(fpath)

According to the PyMuPDF docs,

The major and minor versions of PyMuPDF and MuPDF will always be the same. Only the third qualifier (patch level) may deviate from that of MuPDF.

Strict == is probably needed in setup.py since PyMuPDF do not confirm to the semantic versioning scheme, and introduced breaking changes between v1.18.4 and v1.18.5...

PackElend commented 3 years ago

@kiyo-matsu could you share how you replaced it? I need to write an (at least) a scrip to convert all my bank statements to CSV to push them into https://www.firefly-iii.org/ Testing currently some PDF table extracting libraries. Camelot producing quite good results.

Perhaps, it can be replaced constantly doing CI, as I am afraid that changed here will take some time. If that project really works out well, I would prefer to rely on maintained libraries. Using PyMuPDF would be great as I use it to search text location by using https://pymupdf.readthedocs.io/en/latest/page.html#Page.searchFor

kiyo-matsu commented 3 years ago

@Arnie97 Thank you for your review. I'm happy for solving your problems.

Strict == is probably needed in setup.py since PyMuPDF do not confirm to the semantic versioning scheme, and introduced breaking changes between v1.18.4 and v1.18.5...

I pushed to the branch a version that fixes PyMuPDF version to v1.18.5.

@PackElend

This is my branch that replaces PyPDF2 to PyMuPDF. https://github.com/kiyo-matsu/camelot/tree/use-pymupdf

Perhaps, it can be replaced constantly doing CI, as I am afraid that changed here will take some time.

I run CI locally, and almost all test passed, except only 1 test below. This test checks that a message from PyPDF2 can be hidden by using -q option, but PyMuPDF does not show any message.

tests/test_cli.py::test_cli_quiet FAILED
PackElend commented 3 years ago

If I'm correct you did only chances in

second is only minor and ensures that the right package is installed. If I understand the code correctly, it installs all necessary packages by itself if run on Linux, does it?

The changes in handlers.py are quite simple and could be handled by a function, couldn't? That would allow using both libraries, depending on the start-up parameters.

If I not mistaken only https://github.com/camelot-dev/camelot/blob/master/camelot/cli.py needs an update to accept a flag what allows to chose the library. That would ensure backward compatibility with existing projects which use Camelot. That could be put in PR, what would be most beneficial.

I run CI locally

How do you ensure that your code is not overwritten? Do you replace the lines in a hardcoded manner?

PyMuPDF does not show any message.

I reckon the reason PyMuPDF doesn't output anything to stdout is https://github.com/pymupdf/PyMuPDF/issues/209

MartinThoma commented 2 years ago

PyPDF2 is maintained again since April 2022. I'm the new maintainer. Since then, we fixed a lot of things. I'm currently downloading 800,000 PDF files from Tikas test dataset to ensure we can parse them.

Technical nitpic: Most of those issues are actually not bugs in PyPDF2, but robustness issues. The files don't conform to the PDF standard. PyPDF2 still tries to support them, but not following the standard makes it more difficult.

NotImplementedError: only algorithm code 1 and 2 are supported. This PDF uses code 5

See https://github.com/py-pdf/PyPDF2/pull/749 - was merged :tada:

PyPDF2.utils.PdfReadError: file has not been decrypted

That might be https://github.com/py-pdf/PyPDF2/issues/416 - I was not able to reproduce the issue. Do you have a PDF / sample code to help me reproduce it?

RecursionError: maximum recursion depth exceeded while calling a Python object

That might be https://github.com/py-pdf/PyPDF2/issues/520 - again, I cannot reproduce it. If you have a PDF / code to show it, please let me know :-)