camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
3.01k stars 473 forks source link

Add OCR support #14

Open vinayak-mehta opened 5 years ago

vinayak-mehta commented 5 years ago

The experimental version exists before this commit 9753889. It uses Tesseract (using pyocr). ocropy looked promising the last time I checked, opening this issue for discussion and experiments around OCR.

belisards commented 4 years ago

Hi, is there any update about the OCR support?

vinayak-mehta commented 4 years ago

I hope to do an experiment soon with https://github.com/JaidedAI/EasyOCR.

suyashb95 commented 4 years ago

You could check out OCRmyPDF. Apart from performing OCR, it can deskew/dewarp images (using leptonica). I've used it myself and the results are pretty good but, idk how it performs against EasyOCR. OCRmyPDF does have a dependency on Ghostscript though

vinayak-mehta commented 4 years ago

I was able to get nice results on some images with EasyOCR: https://vinayak.io/2020/09/20/day-29-easyocr-dabblements/ I might try working on a PR to integrate it with the code I mention in the first comment on this issue.

javiqm12 commented 3 years ago

If camelot can offer an entry function that receives a list of words with their bounding boxes coordinates, it will facilitate the integration of any OCR tool that delivers these info, like Tesseract or EasyOCR, others as well.

pdfminer parsing of an OCR PDF like one produced with OCRmyPDF, merges columns frequently, even when you see the column cells very apart in the OCR PDF.

vinayak-mehta commented 3 years ago

If camelot can offer an entry function that receives a list of words with their bounding boxes coordinates

@javiqm12 You can specify table areas and regions with camelot right now, are you referring to another way to provide bounding box coordinates?