MatthiasValvekens / pyHanko

pyHanko: sign and stamp PDF files
MIT License
511 stars 74 forks source link

latin1 causes error #139

Closed B0Gec closed 2 years ago

B0Gec commented 2 years ago

Describe the bug when signing non latin1 charaters cause error. I reccomend 'utf-8' instead of 'latin1' in fonts/basic.py

MatthiasValvekens commented 2 years ago

Hi @bostjangec, thanks for your comment. I'm assuming this is about how SimpleFontEngine encodes text?

PyHanko supports writing Unicode text, but unfortunately it's going to be a bit more complicated than just writing UTF-8 to the content stream.

PDF's text display features are older than Unicode, and displaying non-Latin text "properly" requires some effort. While there are a number of very simple "standard" fonts that (virtually) all PDF readers will offer, and (oversimplifying a little bit) those all work with the Latin character set. That works fine for very simple things, but (as you have discovered) it doesn't really generalise well. This is also why pyHanko uses latin1 in SimpleFontEngine. That was a deliberate choice, since arbitrary UTF-8 probably wouldn't work in a lot of viewers anyhow.

Now, in your case, what you want to do is choose a font of your liking (that supports the characters you need), and embed a subset of it. PyHanko implements that using a font engine called GlyphAccumulator. There's a fairly straightforward example in the docs.

Under the hood, pyHanko will invoke HarfBuzz to handle shaping, and use that to translate your Unicode strings to PDF display operators ("regular" character encodings don't really enter into the equation). The font is then subsetted using fontTools and embedded into the file.

TL;DR: Text handling in PDF is complicated, and the output of SimpleFontEngine effectively can't handle non-Latin text. Use GlyphAccumulator instead.