Should be able to detect tables in image/scan based PDFs as it converts pages to images itself too

adworse / iguvium

Ruby gem for extracting tables from PDF as a structured info

MIT License

199 stars 15 forks source link

Should be able to detect tables in image/scan based PDFs as it converts pages to images itself too #2

Closed ziaulrehman40 closed 3 years ago

ziaulrehman40 commented 4 years ago

I am just wondering why image based pdfs(generated after scans or printed to pdf through windows pdf printer, lets say) are not parsed properly, even though the library says:

 It prints PDF to an image file with GhostScript, then analyses the image.

I know this might be a known limitation, but doesn't seem so from readme, and if it is a limitation readme should indicate it.

PS: Works great on text based pdfs.

Thankyou.

adworse commented 4 years ago

Thank you!

The thing is, it only prints anything but text and images (-dFILTERTEXT -dFILTERIMAGE params of GhostScript, which lefts lines, curves, etc.) to analyze table structure. Text fields are extracted from pdf codepoints, if there are any. Trying to do otherwise would imply a full-blown OCR solution, something like FineReader. So with scanned image-only pdfs it is like an ideal unmatch: nothing is actually printed and there's no text to extract.

ziaulrehman40 commented 4 years ago

@adworse Thankyou for clarifying. Does it make sense to you to put a note about this slight ambiguity for people like me. :slightly_smiling_face: in the readme

PS: Again, works great on text based PDFs. Kudos to you.

adworse commented 4 years ago

It actually does! Thank you for the idea and for the kind words :) I will edit readme in the next couple of days

adworse commented 4 years ago

Thank you again for bringing the issue into my attention! I've added a bit of clarification to the readme file.