KurtCode / PDFKitten

A framework for extracting data from PDFs in iOS
MIT License
391 stars 113 forks source link

Search doesnt find words written in italic #2

Open juulie opened 13 years ago

juulie commented 13 years ago

For some reason, search doesnt include looking into it word that are written in italic, I have been looking into it, but i cant find the part where it goes wrong.

KurtCode commented 13 years ago

I think this is part of the issue where some fonts are not interpreted properly, in this case the italic font.

scinfu commented 13 years ago

some news ?

KurtCode commented 13 years ago

Not yet, unfortunately.

On 19 okt 2011, at 10:34, scinfu reply@reply.github.com wrote:

some news ?

Reply to this email directly or view it on GitHub: https://github.com/KurtCode/PDFKitten/issues/2#issuecomment-2453276

scinfu commented 13 years ago

Where I can see to try to find the bug?

KurtCode commented 13 years ago

All code is on the Git repo. I can't point you in any specific direction since I don't know why this error occurs myself, but feel free to check out the code.

Italic text might use a separate font, if your familiar with the PDF document structure, and maybe that particular font fails to parse in our implementation. Other than that, all I can say is that italic text is just like any other text as far as extracting plain text goes.

On 19 okt 2011, at 12:43, scinfu reply@reply.github.com wrote:

Where I can see to try to find the bug?

Reply to this email directly or view it on GitHub: https://github.com/KurtCode/PDFKitten/issues/2#issuecomment-2454257

scinfu commented 13 years ago

I discovered that the problem happens when the font is of composite type , Do you have some ideas ?

KurtCode commented 13 years ago

Okay, that's good to know. Thanks. :)

Composite fonts are quite complicated, keeping translation tables from character IDs to actual Unicode values, and even entire subfonts embedded within them.

The problem is either (1) that the keyword is not recognized, or (2) something to do with geometry.

I realize that it would be so much easier to hammer these bugs out with better insight into what's going on inside the scanner. I shall attempt to have the scanner print some debug text. Hopefully it will be kept to a reasonable amount of text, so that the interesting bits don't get lost in a flood of gibberish.

I'll get back to you later, I think tonight.

On 20 okt 2011 at 16:37, scinfu reply@reply.github.com wrote:

I discovered that the problem happens when the font is of composite type , Do you have some ideas ?

Reply to this email directly or view it on GitHub: https://github.com/KurtCode/PDFKitten/issues/2#issuecomment-2469427

KurtCode commented 13 years ago

Hi, I just pushed a new version to the master Git branch!

This should make things easier. Now whenever you view a page in the demo app, you can press the info button in the lower right corner, and it will show you the text that is parsed by the scanner. This string is aggregated by the scanner as it goes, after converting it to Unicode. It's all converted to lowercase since the scanner is case-insensitive.

Note that a PDF document does not necessarily contain space characters, but relies on transforming the text transformation maxtrix for advancing the point of drawing glyphs. (Like say, "print foo, then jump one step to the right, and print baz"). That's why words sometimes run together when extracting the raw text content like this.

So, this doesn't solve the problem completely, but it will show you what text the scanner sees when it looks at the words in italic in your PDF document. If the raw text content looks good, then the problem is with the coordinates, not matching letters.

On 20 okt 2011, at 16:37, scinfu wrote:

I discovered that the problem happens when the font is of composite type , Do you have some ideas ?

Reply to this email directly or view it on GitHub: https://github.com/KurtCode/PDFKitten/issues/2#issuecomment-2469427

tanis2000 commented 13 years ago

Hi Kurt,

it looks like the scanner does not take into consideration CID fonts that have multi-byte characters that should be mapped through their CMaps. I have a simple PDF that has some text streams (BT/ET) that have different fonts. One is a simple font that has a character per byte, but then the next paragraph has been encoded with a different font that is a multi-byte characters array, thus the first part is correctly decoded but the second isn't.

KurtCode commented 13 years ago

Okay, I think we need some more debug info. Like what fonts are used etc.

On 21 okt 2011, at 17:00, tanis2000 reply@reply.github.com wrote:

Hi Kurt,

it looks like the scanner does not take into consideration CID fonts that have multi-byte characters that should be mapped through their CMaps. I have a simple PDF that has some text streams (BT/ET) that have different fonts. One is a simple font that has a character per byte, but then the next paragraph has been encoded with a different font that is a multi-byte characters array, thus the first part is correctly decoded but the second isn't.

Reply to this email directly or view it on GitHub: https://github.com/KurtCode/PDFKitten/issues/2#issuecomment-2482257

tanis2000 commented 13 years ago

I can send you a PDF with the corresponding font analysis if it can be of any help. I thought there was a way to attach files to issues but I can't see it.. so here it is from my dropbox: http://dl.dropbox.com/u/3098924/TestPDF.zip

You will see that the text blocks are mixed simple and CID fonts. The simple part gets decoded correctly but the CID part doesn't. I really hope this can point you in the right direction.

KurtCode commented 13 years ago

Thanks! I hope so too. I'll take a look at it.

Because there are many different types of fonts, the more examples the better, and we learn something from each one.

Pstoppani commented 13 years ago

Here is a super simple test file which repros several "can't find text" and "incorrect selection highlights": http://dl.dropbox.com/u/39382628/test.pdf

Try any of these tests that fail:

Search for "te" : It finds two "te"s, but the highlight on the first "te" is off (really highlights only the "T")
Search for "tes": It finds only first "tes" (bug) and the highlight is slightly off Search for "テスト": It can't find any

Hitesh136 commented 7 years ago

any solution?