gettalong / hexapdf

Versatile PDF creation and manipulation for Ruby
https://hexapdf.gettalong.org
Other
1.21k stars 69 forks source link

Handling Missing Glyph Errors and Unexpected Differences Between Drawing Methods #254

Closed dlfischer-cmm closed 1 year ago

dlfischer-cmm commented 1 year ago

Hello, my team has encountered a challenge using HexaPDF that we've been unable to solve. We're creating a PDF that displays user input. In some cases, users have inserted tab characters (\t) as content and this triggers a HexaPDF error "Glyph for "\t" missing". This needs to be handled gracefully in a production environment; an invalid glyph should not be a reason for PDF creation to fail. We have tried to use the doc.config['font.on_missing_glyph'] option without success and I think in the process revealed a bug in how text formatting (or at least line breaks) is handled when this config option is used.

Using the following example:

      text_val = "\nTab Character from \\t: \t pasted tab character:    carriage return:
after carriage return \r\nTHIS SHOULD BE ON ITS OWN LINE\r uiop[]asdfghjkl;'\zxcvbnm,./1234567890-\n~!@# $%^&*()_+}{|\":?><|\n¶öóáßðfïghäåéëöóþü朩µbñ¹²³¤‘\nEmojis (invalid glyphs): 🐭 🐮 🐱 🐵 😀"

      doc = HexaPDF::Document.new
      doc.config['font.map'] = {
        'OpenSans' => {
            none: Rails.root.join('lib/fonts/OpenSans-Regular.ttf'),
            bold: Rails.root.join('lib/fonts/OpenSans-Bold.ttf'),
            italic: Rails.root.join('lib/fonts/OpenSans-Italic.ttf')
          }
      }

      # Glyph solution #1 from HexaPDF example at https://gist.github.com/gettalong/5f13d27a2170e507cd890aa3a4273a43
      doc.config['font.on_missing_glyph'] = ->(n,f) { f.wrapped_font.missing_glyph_id }

      # Glyph solution #2
      # doc.config['font.on_missing_glyph'] = ->(n,f) {  f.glyph(0) }

      # Set the font
      font_name = "Helvetica"
      # font_name = "OpenSans"

      # Draw using canvas.text
      canvas = doc.pages.add.canvas
      canvas.font(font_name, size: 12, variant: :none)
      canvas.text("Writing with canvas.text:#{text_val}", at: [50, 750])

      # Draw using frame.draw with formatted_text_box
      style = HexaPDF::Layout::Style.new
      style.font = font_name
      style.font_size = 12  
      frame = HexaPDF::Layout::Frame.new(50, 250, 400, 400)
      box = doc.layout.formatted_text_box(["Writing with frame.draw (formatted_text_box): #{text_val}"], style: style)
      fit_result = frame.fit(box)
      frame.draw(canvas, fit_result) if fit_result.success?

      io = Tempfile.new('temp_file.pdf')
      doc.write(io)
      io.rewind
      send_file(io, filename: 'tempfile.pdf', type: 'application/pdf', disposition: :inline)

Results: When not using either Glyph solution, this code throws an error Glyph for "\t" missing regardless of the font used.

Comparing the results of using Glyph solutions 1 & 2 along with each font (Helvetica and Open Sans) yields unexpected results.

The expected result is what you see in sample screenshot 2, but with the correct formatting for frame.draw and working for both fonts. In other words, replace missing glyphs with a default glyph and preserve the rest of the text formatting / line breaks.

Any solution you can offer is much appreciated!

Sample 1: HexaPDF Sample 1

Sample 2: HexaPDF Sample 2

gettalong commented 1 year ago

Thanks for opening this issue!

Regarding your results

Your first solution using ->(n,f) { f.wrapped_font.missing_glyph_id } won't work correctly since the result of the lambda needs to be a Glyph object, not a Symbol (in case of Helvetica) or Integer (in case of OpenSans). This was changed 6 years ago.

The second solution using ->(n,f) { f.glyph(0) } will work for TrueType fonts since they use integer glyph IDs but not for Type1 fonts which use glyph names (symbols). You would need to base the result on the f argument.

Generally, the built-in fonts like Helvetica don't have a special glyph representing '.notdef', so you have to choose one of the available glyphs for representing a missing character. TrueType fonts, on the other hand, must have a glyph with ID=0 representing a missing glyph.

And yes, there is a difference in line break handling between Canvas#text and the TextLayouter class. The former uses the provided string and splits it on valid Unicode newline separators. Then those lines are converted to arrays of Glyph objects and those are directly rendered.

The TextLayouter transforms the whole given text string into an array of Glyph objects. During this process the characters of \t, \n, and so on are mapped by default (via font.on_missing_glyph) to InvalidGlyph instances. When layouting the text (i.e. the array of glyph objects) those InvalidGlyph objects are transformed into usable objects if they represent certain special characters like \t or \n.

Since this information isn't available anymore when font.on_missing_glyph is changed to always return the same glyph, line breaks, tabs, and so on cannot be identified anymore and therefore won't work.

Solution

Use the following for font.on_missing_glyph:

require 'hexapdf/font/type1_wrapper'
require 'hexapdf/font/true_type_wrapper'
HexaPDF::Font::Type1Wrapper.public_constant(:Glyph)
HexaPDF::Font::TrueTypeWrapper.public_constant(:Glyph)

doc.config['font.on_missing_glyph'] = lambda do |c, f|
  if f.font_type == :Type1
    HexaPDF::Font::Type1Wrapper::Glyph.new(f.wrapped_font, :question, c)
  else
    HexaPDF::Font::TrueTypeWrapper::Glyph.new(f.wrapped_font, 0, c)
  end
end

This way glyph objects are returned referencing a known, existing glyph in the font (the question mark for Helvetica and the missing glyph for OpenSans) but with different string representations (allowing the TextLayouter to do its work correctly with respect to newlines, tabs, etc.).

I will think about how to make this easier since this is probably something many people would want to do.

gettalong commented 1 year ago

@dlfischer-cmm The next version of HexaPDF comes with a helper method that allows you to achieve the solution in an easier way. This is now also documented in the font.on_missing_glyph configuration option:

doc.config['font.on_missing_glyph'] = lambda do |character, font_wrapper|
  font_wrapper.custom_glyph(font_wrapper.font_type == :Type1 ? :question : 0, character)
end
dlfischer-cmm commented 1 year ago

Excellent! Thank you for the detailed explanation. My team has implemented the solution you provided and it's working perfectly in our tests. We appreciate your help! Have a great day. :)