gettalong / hexapdf

Versatile PDF creation and manipulation for Ruby
https://hexapdf.gettalong.org
Other
1.21k stars 69 forks source link

Error encoding font when reading the text of an specific PDF #280

Closed bidsketchris closed 7 months ago

bidsketchris commented 7 months ago

We're using since a year ago a script to read the text of a PDF that has been working without troubles until we start having issues with a particular PDF. I attached a working example of how to reproduce the error with the mentioned PDF and some other instructions. hexa_test.zip

Thanks.

gettalong commented 7 months ago

Thanks for reporting - I can reproduce the problem and will investigate.

gettalong commented 7 months ago

The error comes from the fact that the font has, in principle, a mapping from character representation to Unicode but is missing one such mapping. You can see that if you open the PDF in a reader, then select the word "activity" in the first line under the heading and copy it somewhere, it will not have a Unicode representation of at least the "ti" ligature character.

Usually you can handle such cases using the configuration option 'font.on_missing_unicode_mapping' but that fails because HexaPDF errors out before. I will fix that so that the configuration option is triggered.

gettalong commented 7 months ago

I have fixed the problem. With the next release you need to properly configure the font.on_missing_unicode_mapping option and it will work.