Odd behavior with soft-hyphens

gettalong / hexapdf

Versatile PDF creation and manipulation for Ruby

https://hexapdf.gettalong.org

Other

1.25k stars 70 forks source link

Odd behavior with soft-hyphens #332

Closed mockdeep closed 1 month ago

mockdeep commented 1 month ago

We have a user who copy/pasted some text from somewhere, and it has a mixture of soft hyphens and hard hyphens:

"06\xAD-220-\xAD3010-\xAD0-\xAD1110-\xAD1000-\xAD2111"

In the browser, this displays with just the normal hyphens:

"06-220-3010-0-1110-1000-2111"

But when we write it to a PDF with HexaPDF, it strips out all of the hyphens:

"0622030100111010002111"

My understanding of soft-hyphens is very limited, but based on my reading it doesn't seem like it should also cause adjacent hard hyphens to be removed. Am I missing something?

gettalong commented 1 month ago

I tried to create a reproducible example:

require 'hexapdf'

HexaPDF::Composer.create('gh332.pdf') do |c|
  c.text("06\u00AD-220-\u00AD3010-\u00AD0-\u00AD1110-\u00AD1000-\u00AD2111")
end

This results in a PDF that looks like this in Okular:

Copying and pasting that string from Okular works as expected and hexapdf inspect also shows this (see the text> line):

$ hexapdf ins gh332.pdf psd 1
save_graphics_state
  set_font_and_size /F1 10
  begin_text
    set_text_matrix 1 0 0 1 36 799.059764
    text> 06-220-3010-0-1110-1000-2111
  end_text
restore_graphics_state

Could you provide an example that shows the wrong behaviour?

mockdeep commented 1 month ago

@gettalong hmm, it looks like it may be a font issue. When we render it with the default font it does show the dashes, but when we render it with Arimo they disappear. I'm guessing that means this has nothing to do with HexaPDF. I'll leave it to you to close if you agree.

gettalong commented 1 month ago

@mockdeep I will have a look at the font and will report back.

gettalong commented 1 month ago

@mockdeep I have looked at the font and what HexaPDF does with it.

The font uses the same glyph with id=16 for hypens and soft-hyphens. Usually soft-hyphens are not represented in the font.
When HexaPDF encounters the first soft-hyphen, it retrieves the glyph for the soft-hyphen (id=16) and memoizes it for faster retrieval next time. Then it also stores a mapping from the soft-hyphen codepoint to the memoized glyph.
When HexaPDF then encounters the first hyphen, it finds that it is also mapped to the glyph with id=16 which is already memoized and stores the mapping from the hyphen codepoint to the already memoized glyph.
A glyph object itself, however, is not designed in HexaPDF to represent two different codepoints. So when the layouting code comes across the glyph, it sees that it is the glyph for soft-hyphen, regardless of whether that glyph was initially created for a hyphen or a soft-hyphen.
This means that the first occurrence of either hyphen or soft-hyphens determines how all are represented which is clearly a bug in HexaPDF.

I will have to see how to handle this because it affects various parts of the font handling and font embedding code.

mockdeep commented 1 month ago

Holy cow, this rabbit hole just goes deeper and deeper. When I started to look into it I figured it was a missing glyph, then I circled around to string encoding issue, now it's a sort-of-missing glyph.

gettalong commented 1 month ago

@mockdeep I have wrapped my head around this and have a potential solution. It works correctly for the case you brought up here but I still need to test the case of two codepoints mapped to the same glyph where both codepoints are normal characters (e.g. not a soft-hyphen, line break...).

mockdeep commented 1 month ago

@gettalong thanks so much! I really appreciate how responsive you've been on this stuff.

gettalong commented 1 month ago

@mockdeep Could you please try out the devel branch which should fix your problem?

mockdeep commented 1 month ago

@gettalong sorry for the delay. I just got a chance to try out your branch and it works! The hyphens are appearing as expected.

gettalong commented 1 month ago

@mockdeep Perfect! You can expect a release with the fix this weekend.