camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
2.96k stars 466 forks source link

Add pypdfium2 rendering backend #384

Closed mara004 closed 1 month ago

KOLANICH commented 1 year ago

"pypdfium2>=4,<5",

< restrictions are considered harmful. Unfortunately pip devs have introduced them and are not going to drop their support or even to introduce a mode to ignore them, so I have even had to create a tool undoing such kind of sabotage from wheels.

mara004 commented 1 year ago

@KOLANICH Uh, you seem to share some unorthodox opinions on packaging (first https://github.com/lebedov/python-pdfbox/issues/10#issuecomment-1603116992, now this).

< bounds are not generally harmful; they are merely protective when the next major release brings API-breaking changes, to ensure the project remains installable/working by default. Bumping to a new major version should be a deliberate decision after review the code is still compatible. See also https://github.com/mindee/doctr/issues/947.

I don't currently have a mind of changing the APIs used in this case, but won't promise anything. If the camelot projects wants unsafe bounds in the hope that it will continue to work without interaction, then be it. But I don't encourage this and am not responsible for any breakage that might arise.

KOLANICH commented 1 year ago

remains installable by default.

It will cause version conflicts and maybe even downgrades.

I don't currently have a mind of changing the APIs used in this case, but won't promise anything.

Let's deal with it when the changes in API will actually break the package. By making this package compatible to both APIs of course. But intentionally breaking package (by specifying < specifier) prematurely is IMHO not a rational thing. I prefer to optimistically assumme that everything is compatible to any further version and treat the incompatibity to the latest available version as a grave bug if that version was released and as a serious bug if that version just landed into master.

am not responsible for any breakage that might arise.

camelot is not maintained by you, so it is not up to you to fix its bugs

mara004 commented 1 year ago

Hmm yes, these aspects need to be considered. Conflict handling is a problem indeed. On one hand, upper bounds are good to install the most compatible version if there's no conflict. On the other, it would be bad if pip aborted on conflict. I'm not 100% sure what it does - I think it downgrades, though.

Apart from that, one can still use virtual envs to tackle hard conflicts.

I think you may be right, in this particular case it could be better to remove the upper bounds, as camelot still has fallback logic here (however, the fallbacks aren't very reliable, since they depend on additional system programs). I'll await a maintainer's opinion before changing it, anyway.


remains installable by default.

It will cause version conflicts and maybe even downgrades.

To be more precise: "so the code remains working by default".

mara004 commented 1 year ago

CI currently fails at make install, which is unrelated to this PR's changes. Looks like #387 would fix this.

foarsitter commented 1 year ago

@mara004 by any change, do you have time to rebase this PR on the master branch? Quite a lot has changed in #353.

mara004 commented 1 year ago

I may be able to check. Supposedly it won't be a complicated rebase, as the changes from this PR are relatively small.

Side-note: Checkout of this repository took unusually long for me, even with --depth 1. IIRC this was much less so when I last visited.

mara004 commented 1 year ago

I merged in master and fixed the conflicts.

mara004 commented 1 year ago

@foarsitter Who is in charge of reviewing/merging PRs here? I'm not keen on having to do another rebase, frankly...

foarsitter commented 1 year ago

@mara004 the project has no lead.

Can you write some brief documentation about why pdfium is the go-to backend?

mara004 commented 1 year ago

@mara004 the project has no lead.

I meant, who has commit access and would be able to merge this PR?

Can you write some brief documentation about why pdfium is the go-to backend?

Where should I add that? Is a code comment in pdfium_backend.py sufficient?

How about something like this: pdfium appears liberal-licensed, stable and fast. pypdfium2 bridges pdfium to python using ctypes, regularly building standalone wheels for all major platforms. It does not require additional system packages on runtime.

Notes on possible alternatives:[^1] pdfbox and pdfjs are the only other liberal-licensed PDF rendering engines known to the author, but they are not properly bridged to python yet and would need additional runtime envs. Technically, pymupdf seems better than pypdfium2, but has a restrictive license (AGPL). poppler and ghostscript are slower than pdfium, under restrictive licenses and lack pre-built binaries, or need subprocess/pipe communication.

[^1]: Does not need to be added to the docs, but FYI

foarsitter commented 1 year ago

Thanks for helping me placing this PR in a broader perspective. A comment would be sufficient.

I will open an issue to discuss the three backends camelot currently offers and maybe we should drop poppler in favor of pdfium.

bosd commented 1 month ago

@mara004 Thanks for this very interesting pr.

As @foarsitter pointed out, this project has no lead. And this repo has been unmaintained https://github.com/camelot-dev/camelot/issues/343 for some while now. We try to build a maintained fork at pypdf_table_extraction.

Do you want to open the PR against that repo so that we can merge your improvement?

mara004 commented 1 month ago

Thanks for the update @bosd.

I'd be happy with this change landing in pypdf_table_extraction, but am not interested in doing the transfer personally and going through another PR process.

So, if the maintainers of pypdf_table_extraction want this patch, they may transfer it on their own. You can probably do something like

wget https://patch-diff.githubusercontent.com/raw/camelot-dev/camelot/pull/384.diff
git apply 384.diff

and then fix merge conflicts, if any.

mara004 commented 1 month ago

Given that camelot is unmaintained, I'll close this PR here, but you may still use the patch of course.