KurtCode / PDFKitten

A framework for extracting data from PDFs in iOS
MIT License
391 stars 113 forks source link

PDF Text scanner missing line breaks and space #39

Open omerabbas01 opened 12 years ago

omerabbas01 commented 12 years ago

Hello,

Thank you for providing such a beautiful framework to handle the PDF, Your framework save allot of my time, Helped me allot. There are some things i have noticed in the framework while creating custom text highlighting feature. Highlighting works while Text to speech read aloud. For that i am using NSRange to determine which part of string to be highlight. Everything working very good so far i am able to highlight. But there are some issues with pdf scan text. There are some missing spaces between words and Also missing line breaks.

I have never worked with PDF before, Also i don't know much about PDF. But now i am looking into it how things are working. So i have found you are using CGPDFScannerRef to scan text from PDF. So there must be something i can do that help me to get better text. Can you please guide me a bit where should i look and if there's any tutorial about CGPDFScannerRef.

Thank you!

KurtCode commented 12 years ago

Spaces and line breaks may not (and will most likely not) be represented as characters in text objects. Instead, while drawing the document, you will be instructed to move the current point of focus (the "cursor") something like 12 points to the right, i.e. a space between two words.

As I recall, the width of a space is not included in the font, so you would have to listen for those operators that change the text matrix, and decide whether the horizontal translation is large enough to be a space character. There are separate operators for newlines, so that one is easy to implement.

Hope this helps.

omerabbas01 commented 12 years ago

Thank you for the reply, Seems like this is gonna be a tough job, I haven't looked into font yet. Gonna look into it and will let you know if i am succeed.

Thank you

KurtCode commented 12 years ago

Sure, working with PDFs gets complicated sometimes.

On 15 jun 2012, at 11:31, omerabbas01 reply@reply.github.com wrote:

Thank you for the reply, Seems like this is gonna be a tough job, I haven't looked into font yet. Gonna look into it and will let you know if i am succeed.

Thank you


Reply to this email directly or view it on GitHub: https://github.com/KurtCode/PDFKitten/issues/39#issuecomment-6352729

hugo53 commented 10 years ago

@omerabbas01 Have you resolved your problem? I am being stuck in this issue and using a temporary solution: split multi-words keywords and search for separate word, then do some complex code to locate the right place for all words in the keyword. Thereafter, draw all result frames!