camelot.read_pdf is merging columns of the PDF Page

atlanhq / camelot

Camelot: PDF Table Extraction for Humans

https://camelot-py.readthedocs.io

Other

3.64k stars 355 forks source link

camelot.read_pdf is merging columns of the PDF Page #255

Closed Prady96 closed 5 years ago

Prady96 commented 5 years ago

I am trying to read a page from a PDF have tried all the options as per stated in your documentation and your website the two columns are getting merged. I have linked the Input PDF Link in here : INPUT PDF PAGE

and also added link for the output of the csv File : OUTPUT CSV PAGE`

Please let us know whether Camelot be able to parse this PDF Table into CSV.

Thanking you for making this awesome library Pradyum Gup

anakin87 commented 5 years ago

How this PDF was generated? Did you use OCRMyPDF?

Does Camelot generally work with OCRMyPDF generated files?

Prady96 commented 5 years ago

This was an encrypted PDF we ran 'ocrmypdf' to make mask layer on it and then we parsed this with camelot.. If you require should i post the original PDF also?

Prady96 commented 5 years ago

Here is the link for the original PDF ORIGINAL PDF we were trying to parse Page num 4 of this PDF.. Hope it helps

vinayak-mehta commented 5 years ago

@Prady96 The table in this PDF doesn't have a very clean table structure. One of the larger lines that goes through the whole table doesn't touch the top-most line and one of the header line has a different x-coordinate from the larger one that goes through the whole table. I was able to use a line_scale=40 (see line_scale in docs) to correct this. Please close the issue if this solved your problem.

Here's how I debugged it visually:

Plotted the table structure being detected with Lattice by default.

$ camelot lattice -plot grid page4_yourfile_OCRMYPDF.pdf

before

Plotted the table structure being detected with Lattice with line_scale=40, which gave the expected table structure.

$ camelot lattice -scale 40 -plot grid page4_yourfile_OCRMYPDF.pdf

after

anakin87 commented 5 years ago

@vinayak-mehta : Have you ever tried to use OCRMyPDF? Does it work well with Camelot?

vinayak-mehta commented 5 years ago

No, I haven't tried it. Looking at the output @Prady96 posted, seems like ocrmypdf doesn't work very accurately. I can see some decimal points being missed in numbers present in the last two columns.

Prady96 commented 5 years ago

Thanks for your reply @vinayak-mehta

I will try this one.

As i found data inside the pdf was encrypted i was not able to take out values from table therefore i used "ocrmypdf" just for extraction for data points.

I tried all the visual debugging and things that at the end worked out was

tables = camelot.read_pdf('page4_yourfile_OCRMYPDF.pdf', flavor='stream', edge_tol=100) camelot.plot(tables[0], kind='contour'

Is there anyway that we can extract data points without using 'ocrmypdf' because we are still not able to get right data from table because not only numbers are getting missed but also some letters are changed?

vinayak-mehta commented 5 years ago

You can try Tesseract (open-source) or docparser.com (closed-source) which will extract the text from your image using OCR. I've seen docparser.com do OCR very nicely. You could also try Google Cloud Vision API.

If you find any good open-source OCR libraries or tools, then do comment on #101. Please close this issue if your original problem of merged columns was solved.

Prady96 commented 5 years ago

Thanks for you support!

This fixes #255