gettalong / hexapdf

Versatile PDF creation and manipulation for Ruby
https://hexapdf.gettalong.org
Other
1.22k stars 70 forks source link

Odd behavior with soft-hyphens #332

Open mockdeep opened 1 week ago

mockdeep commented 1 week ago

We have a user who copy/pasted some text from somewhere, and it has a mixture of soft hyphens and hard hyphens:

"06\xAD-220-\xAD3010-\xAD0-\xAD1110-\xAD1000-\xAD2111"

In the browser, this displays with just the normal hyphens:

"06-220-3010-0-1110-1000-2111"

But when we write it to a PDF with HexaPDF, it strips out all of the hyphens:

"0622030100111010002111"

My understanding of soft-hyphens is very limited, but based on my reading it doesn't seem like it should also cause adjacent hard hyphens to be removed. Am I missing something?

gettalong commented 1 week ago

I tried to create a reproducible example:

require 'hexapdf'

HexaPDF::Composer.create('gh332.pdf') do |c|
  c.text("06\u00AD-220-\u00AD3010-\u00AD0-\u00AD1110-\u00AD1000-\u00AD2111")
end

This results in a PDF that looks like this in Okular:

image

Copying and pasting that string from Okular works as expected and hexapdf inspect also shows this (see the text> line):

$ hexapdf ins gh332.pdf psd 1
save_graphics_state
  set_font_and_size /F1 10
  begin_text
    set_text_matrix 1 0 0 1 36 799.059764
    text> 06-220-3010-0-1110-1000-2111
  end_text
restore_graphics_state

Could you provide an example that shows the wrong behaviour?

mockdeep commented 1 week ago

@gettalong hmm, it looks like it may be a font issue. When we render it with the default font it does show the dashes, but when we render it with Arimo they disappear. I'm guessing that means this has nothing to do with HexaPDF. I'll leave it to you to close if you agree.

gettalong commented 1 week ago

@mockdeep I will have a look at the font and will report back.

gettalong commented 1 week ago

@mockdeep I have looked at the font and what HexaPDF does with it.

I will have to see how to handle this because it affects various parts of the font handling and font embedding code.

mockdeep commented 1 week ago

Holy cow, this rabbit hole just goes deeper and deeper. When I started to look into it I figured it was a missing glyph, then I circled around to string encoding issue, now it's a sort-of-missing glyph.

gettalong commented 6 days ago

@mockdeep I have wrapped my head around this and have a potential solution. It works correctly for the case you brought up here but I still need to test the case of two codepoints mapped to the same glyph where both codepoints are normal characters (e.g. not a soft-hyphen, line break...).

mockdeep commented 4 days ago

@gettalong thanks so much! I really appreciate how responsive you've been on this stuff.

gettalong commented 2 days ago

@mockdeep Could you please try out the devel branch which should fix your problem?