atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.62k stars 350 forks source link

Replacing Ghostscript with PDFBox #346

Closed fakabbir closed 5 years ago

fakabbir commented 5 years ago

Replacing Ghostscript with alternative opensource package #342

As an approach PDFBox can be used as an alternative for Ghostscript. At present PDFBox can be used via python using the wrapper provided by python-pdfbox. Recently, python-pdfbox added the functionality to convert PDF to images.

Using python-pdfbox PDF can be converted to images sequentially.

Also What about allowing user to choose from both the libraries, in case they already have Ghostscript license purchased ?

vinayak-mehta commented 5 years ago

What advantages does pdfbox have over ghostscript?

fakabbir commented 5 years ago

PDFBox would make camelot more close to MIT license. Ghostscript is available as AGPL/commerical licensed product. If someone wants to use camelot(at present), he/she needs to download and install Ghostscript separately. This may or mayn't be feasible in certain cases.

In case we shift we PDFBox, which is an Apache license package, the user has an advantage of not installing dependencies separately. Doing pip install would fetch all the dependencies.

Also are you sure that using AGPL licensed package the way you did comes under MIT not in AGPL? I mean if you use AGPL package, by default means that you need to distribute it under AGPL license only.

PS: The concern with ghostscript is

vinayak-mehta commented 5 years ago

I understand the part about licensing, we want to remove ghostscript altogether. https://github.com/camelot-dev/camelot/issues/13

Just went through python-pdfbox, it automatically downloads and caches the pdfbox jar file which should make installation easier for users, as installing ghostscript has been a pain on Windows. But then again, the users would need Java to use the library. An interesting tradeoff, we should definitely discuss about it here https://github.com/camelot-dev/camelot/issues/27.

Can you please raise this PR here so that we can see if tests pass. You'll also have to edit setup.py to install python-pdfbox.