camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
2.86k stars 452 forks source link

Great library, but dependencies ??!! #49

Open akshowhini opened 4 years ago

akshowhini commented 4 years ago

Note: This is not an issue, yet no better place to discuss on this.

Stats below are pulled from PyPI downloads. Despite being a better process than the others, what do you think supports the less usage.

image

vinayak-mehta commented 4 years ago

Yep, this is a known issue. We need to figure out a way to replace ghostscript and opencv. https://github.com/camelot-dev/camelot/issues/13

Camelot uses only a small subset of code from ghostscript [1] (converting PDF to PNG) and opencv [2] (adaptive thresholding and morphological transformations). The only way I can think of is to re-implement these in Python and have them inside Camelot itself. [2] should be straightforward. Do you have any ideas around [1]? Or any other pointers?

Ghostscript is written in C, I tried looking around in the huge codebase but was totally lost. I'm planning to look into this again by allotting time over the next month, currently the day job takes up a lot of the time. Any pointers would really help!

vinayak-mehta commented 4 years ago

@jnothman Do you have any pointers around this?

satheeshkatipomu commented 4 years ago

@vinayak-mehta , Have you tried pdftoppm( poppler utils) for converting pdf to png.

vinayak-mehta commented 4 years ago

Yep, I tried it along with imagemagick before landing on ghostscript since the last one gave the best results in terms of image quality.

satheeshkatipomu commented 4 years ago

Ok, In this post, there was one more suggestion to do this with MuPDF

jnothman commented 4 years ago

Hey @vinayak-mehta, not sure where I can help here! The change to pdfbox in #30 has been implemented too, but you need to confirm what kinds of discrepancies are acceptable between backends.

akshowhini commented 4 years ago

@vinayak-mehta I did not dig into the dependencies & library much. Based on those numbers and hoping to help with reduced dependencies and offering Pro service (to extract tables from images and scan PDFs) for camelot devs, i worked for https://extracttable.com to develop CamelotPro (taken down because of naming conflict)

If you think the service as an add-on helps the regular camelot users like me, I would be happy to talk with my team to merge CamelotPro with the open sourced lib

dimitern commented 4 years ago

@akshowhini Camelot already is open source and MIT licensed. It looks like your CamelotPro uses some of the sources from Camelot (I guess to make it compatible), and is GPL 3.0 licensed. It will be nice to mention the original authors somewhere as well.

The SaaS backend you're using is proprietary licensed, but I supposed also uses Camelot.

I'd be interested to know more about the CamelotPro flavour, for example - does it autodetect whether it should use lattice or stream ?

vinayak-mehta commented 4 years ago

If you think the service as an add-on helps the regular camelot users like me, I would be happy to talk with my team to merge CamelotPro with the open sourced lib

I think it would make sense to include CamelotPro in Camelot if former's code is open-sourced. (assuming it works well on the current tests and more image-based tests)

I'd be interested to know more about the CamelotPro flavour, for example - does it autodetect whether it should use lattice or stream ?

I'd be interested in learning about the internal workings too!

I tried running the example you've provided in the README but it fails with a KeyError: https://github.com/ExtractTable/camelotpro/issues/1

akshowhini commented 4 years ago

@dimitern

Reg: Credits - : Wonder, how in the world I missed it. Thankful to all you guys for the contributions. Updated the readme as well.

Reg: flavor recognition -: No, the AI model does not care about lattice or stream, all it was trained is to detect the tabular structure - consider as a replacement for Nurminen's algorithm.

akshowhini commented 4 years ago

I think it would make sense to include CamelotPro in Camelot if former's code is open-sourced. (assuming it works well on the current tests and more image-based tests)

I do not think camelotpro catches the tests of camelot-py, the base and main problem trying to tackle here is to extract tabular structure and characters from images and scan pdfs

luke4u commented 3 years ago

converting PDF to PNG

Hi @vinayak-mehta , have you heard about pdf2image? It is much less pain than Ghostscript. I don't recommend mupdf, as it is licensed under AGPL. Not good for commercial usage.

vinayak-mehta commented 3 years ago

@luke4u Yes I've heard about it. I was initally reluctant to replace ghostscript with pdf2image as users on all platforms would have to install it separately too. Doing this replacement would be difficult as it would break installations for old users when they suddenly need poppler-utils instead of ghostscript after they upgrade their camelot version. I guess it could be done in a backwards-compatible manner where ghostscript and pdf2image are different backends that camelot can use based on availability.

I'll try to move fast on my goal of not using an external pdf -> png conversion tool as a depedency altogether, and make the library self-contained.

luke4u commented 3 years ago

thank you @vinayak-mehta . look forward to your progress!

HamedMP commented 7 months ago

Trying to install pdftopng and getting this error. Worth using more stable dependencies:

  • Installing pdftopng (0.2.3): Failed

  RuntimeError

  Unable to find installation candidates for pdftopng (0.2.3)

  at ~/Library/Application Support/pypoetry/venv/lib/python3.11/site-packages/poetry/installation/chooser.py:73 in choose_for
       69│
       70│             links.append(link)
       71│
       72│         if not links:
    →  73│             raise RuntimeError(f"Unable to find installation candidates for {package}")
       74│
       75│         # Get the best link
       76│         chosen = max(links, key=lambda link: self._sort_key(package, link))
agamm commented 6 months ago

Trying to install pdftopng and getting this error. Worth using more stable dependencies:

  • Installing pdftopng (0.2.3): Failed

  RuntimeError

  Unable to find installation candidates for pdftopng (0.2.3)

  at ~/Library/Application Support/pypoetry/venv/lib/python3.11/site-packages/poetry/installation/chooser.py:73 in choose_for
       69│
       70│             links.append(link)
       71│
       72│         if not links:
    →  73│             raise RuntimeError(f"Unable to find installation candidates for {package}")
       74│
       75│         # Get the best link
       76│         chosen = max(links, key=lambda link: self._sort_key(package, link))

I have the same error with poetry add "camelot-py[base]"

sfrancis19 commented 2 months ago

Trying to install pdftopng and getting this error. Worth using more stable dependencies:

  • Installing pdftopng (0.2.3): Failed

  RuntimeError

  Unable to find installation candidates for pdftopng (0.2.3)

  at ~/Library/Application Support/pypoetry/venv/lib/python3.11/site-packages/poetry/installation/chooser.py:73 in choose_for
       69│
       70│             links.append(link)
       71│
       72│         if not links:
    →  73│             raise RuntimeError(f"Unable to find installation candidates for {package}")
       74│
       75│         # Get the best link
       76│         chosen = max(links, key=lambda link: self._sort_key(package, link))

I have the same error with poetry add "camelot-py[base]"

Same error here. Did you ever find a resolution to this? @HamedMP @agamm

HamedMP commented 1 month ago

Trying to install pdftopng and getting this error. Worth using more stable dependencies:

  • Installing pdftopng (0.2.3): Failed

  RuntimeError

  Unable to find installation candidates for pdftopng (0.2.3)

  at ~/Library/Application Support/pypoetry/venv/lib/python3.11/site-packages/poetry/installation/chooser.py:73 in choose_for
       69│
       70│             links.append(link)
       71│
       72│         if not links:
    →  73│             raise RuntimeError(f"Unable to find installation candidates for {package}")
       74│
       75│         # Get the best link
       76│         chosen = max(links, key=lambda link: self._sort_key(package, link))

I have the same error with poetry add "camelot-py[base]"

Same error here. Did you ever find a resolution to this? @HamedMP @agamm

Unfortunately I don't fully remember how I resolved it, but I was doing something wrong, which I didn't have to do. It's been a while, can't remember exactly what it was, might've been something with using the right env/command to install (it was a stupid mistake). After that it was just giving Warnings, but I could use the library.

Sorry I can't be of more help :/