Closed ziaulrehman40 closed 3 years ago
Thank you!
The thing is, it only prints anything but text and images (-dFILTERTEXT -dFILTERIMAGE
params of GhostScript, which lefts lines, curves, etc.) to analyze table structure. Text fields are extracted from pdf codepoints, if there are any. Trying to do otherwise would imply a full-blown OCR solution, something like FineReader. So with scanned image-only pdfs it is like an ideal unmatch: nothing is actually printed and there's no text to extract.
@adworse Thankyou for clarifying. Does it make sense to you to put a note about this slight ambiguity for people like me. :slightly_smiling_face: in the readme
PS: Again, works great on text based PDFs. Kudos to you.
It actually does! Thank you for the idea and for the kind words :) I will edit readme in the next couple of days
Thank you again for bringing the issue into my attention! I've added a bit of clarification to the readme file.
I am just wondering why image based pdfs(generated after scans or printed to pdf through windows pdf printer, lets say) are not parsed properly, even though the library says:
I know this might be a known limitation, but doesn't seem so from readme, and if it is a limitation readme should indicate it.
PS: Works great on text based pdfs.
Thankyou.