atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.66k stars 360 forks source link

No tables found on page #238

Closed kento1109 closed 5 years ago

kento1109 commented 5 years ago

I want to extract DXA Results Summary table from PDF like this.

Sample_Dexa_Report.pdf

But, I cannot handle it..(Camelot warn that no tables found on page)

I tried both lattice and stream mode. But I cannot do well. How to extract table from this PDF ??

vinayak-mehta commented 5 years ago

@kento1109 As mentioned in the README: "Camelot only works with text-based PDFs and not scanned documents. If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based."

Aside from being an image, the document you've attached is rotated. You can fix the rotation and try using OCR to extract data from this document.

kento1109 commented 5 years ago

Thank you for the quick replay! I noticed this pdf is based on the image when parsing by pdfminer.

First of all, I tried OCR to transform image to text data.