camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
2.96k stars 466 forks source link

class ConversionBackend(object) not working as expected #381

Open unnikrishnancs opened 1 year ago

unnikrishnancs commented 1 year ago

Hi,

As I was having problem with ghostscript and poppler, I wanted to try out the alternate pdf to image conversion backend . But not working as expected and I am not sure where am I going wrong. I tried the below code and its giving me an Opencv error. " cv2.error: OpenCV(4.7.0) /io/opencv/modules/imgproc/src/color.cpp:182: error: (-215:Assertion failed) !_src.empty() in function 'cvtColor' ".... Code is below. Its converting pdf to png , and giving me path (ex. png_path= /tmp/tmpop73i0dk/page-1.png) Can someone please help ?

`import camelot as cl

class ConversionBackend(object): def convert(self, pdf_path, png_path): pass

pdf_path="/home/user/Downloads/foo.pdf"

obj=ConversionBackend() tbl = cl.read_pdf(pdf_path, backend=obj) print("tbl.n=",tbl.n) `

0x006E commented 1 year ago

The ConversionBackend class should implement a convert method which converts the specified pdf from pdf_path to png and save it in the png_path.

You must implement that logic in the convert method, the sample provided in the docs does not do anything, it is just a skeleton.

To implement this, I used pymupdf to create png of the pdf, like this

import fitz

class ConversionBackend(object):
    def convert(self, pdf_path, png_path):
        # Open the PDF file
        doc = fitz.open(pdf_path)

        # Read the PDF page, since only 1 paged PDF is provided , we load the page at index 0
        page = doc.load_page(0)

        # Convert PDF page to image
        pix = page.get_pixmap()

        # Write image to PNG file
        pix.save(png_path)

        # Close the PDF file
        doc.close()

Now you can pass this to camelot by

camelot.read_pdf(pdf_path, backend=ConversionBackend())

To use this, you must first install pymupdf using pip

unnikrishnancs commented 1 year ago

Thanks for replying. I will follow the steps mentioned.

SWHL commented 1 year ago

Another solution:

import camelot
from pdf2image import convert_from_path

class ConversionBackend:
    def convert(self, pdf_path, png_path):
        img_list = convert_from_path(pdf_path)
        img_list[0].save(png_path)

pdf_path = "1.pdf"
tables = camelot.read_pdf(
    pdf_path,
    pages='1',
    flavor="lattice",
    line_scale=40,
    backend=ConversionBackend()
)

fig = camelot.plot(tables[0], kind='contour')
fig.savefig('res.png')
unnikrishnancs commented 1 year ago

Thankyou. I will try

Regards

On Wed 23 Aug, 2023, 12:19 PM SWHL, @.***> wrote:

Another solution:

import camelotfrom pdf2image import convert_from_path

class ConversionBackend: def convert(self, pdf_path, png_path): img_list = convert_from_path(pdf_path) img_list[0].save(png_path)

pdf_path = "1.pdf"tables = camelot.read_pdf( pdf_path, pages='1', flavor="lattice", line_scale=40, backend=ConversionBackend() ) fig = camelot.plot(tables[0], kind='contour')fig.savefig('res.png')

— Reply to this email directly, view it on GitHub https://github.com/camelot-dev/camelot/issues/381#issuecomment-1689377702, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ7WS42YIXAN6XYJYNG2OHLXWWRWHANCNFSM6AAAAAAYSMZQAY . You are receiving this because you authored the thread.Message ID: @.***>