Closed Prady96 closed 5 years ago
How this PDF was generated? Did you use OCRMyPDF?
Does Camelot generally work with OCRMyPDF generated files?
This was an encrypted PDF we ran 'ocrmypdf' to make mask layer on it and then we parsed this with camelot.. If you require should i post the original PDF also?
Here is the link for the original PDF ORIGINAL PDF we were trying to parse Page num 4 of this PDF.. Hope it helps
@Prady96 The table in this PDF doesn't have a very clean table structure. One of the larger lines that goes through the whole table doesn't touch the top-most line and one of the header line has a different x-coordinate from the larger one that goes through the whole table. I was able to use a line_scale=40
(see line_scale in docs) to correct this. Please close the issue if this solved your problem.
Here's how I debugged it visually:
$ camelot lattice -plot grid page4_yourfile_OCRMYPDF.pdf
line_scale=40
, which gave the expected table structure.$ camelot lattice -scale 40 -plot grid page4_yourfile_OCRMYPDF.pdf
@vinayak-mehta : Have you ever tried to use OCRMyPDF? Does it work well with Camelot?
No, I haven't tried it. Looking at the output @Prady96 posted, seems like ocrmypdf doesn't work very accurately. I can see some decimal points being missed in numbers present in the last two columns.
Thanks for your reply @vinayak-mehta
I will try this one.
As i found data inside the pdf was encrypted i was not able to take out values from table therefore i used "ocrmypdf" just for extraction for data points.
I tried all the visual debugging and things that at the end worked out was
tables = camelot.read_pdf('page4_yourfile_OCRMYPDF.pdf', flavor='stream', edge_tol=100) camelot.plot(tables[0], kind='contour'
Is there anyway that we can extract data points without using 'ocrmypdf' because we are still not able to get right data from table because not only numbers are getting missed but also some letters are changed?
You can try Tesseract (open-source) or docparser.com (closed-source) which will extract the text from your image using OCR. I've seen docparser.com do OCR very nicely. You could also try Google Cloud Vision API.
If you find any good open-source OCR libraries or tools, then do comment on #101. Please close this issue if your original problem of merged columns was solved.
Thanks for you support!
This fixes #255
I am trying to read a page from a PDF have tried all the options as per stated in your documentation and your website the two columns are getting merged. I have linked the Input PDF Link in here : INPUT PDF PAGE
and also added link for the output of the csv File : OUTPUT CSV PAGE`
Please let us know whether Camelot be able to parse this PDF Table into CSV.
Thanking you for making this awesome library Pradyum Gup
ta