Error when using HarfBuzz renderer

poetaman commented 3 years ago

Hi Josef,

Your package causes a lua error when I use HarfBuzz renderer with a non-latin script. The error is related to the way HarfBuzz renderer returns UTF-8 string data, if you change the renderer to Renderer=Node instead of Renderer=HarfBuzz in the test case below, you will not encounter the error (as a way to narrow down the source of the problem, which in this case is nodetree's handling of UTF-8 strings from HarfBuzz renderer). I have used Noto Sans Devanagari in the code snippet below, it is available for free from Google at google fonts GitHub repository. Though the bug is font-agnostic, and you will be able to recreate this issue on your side with any other OpenType Devanagari font. The issue is also script-agnostic, it is just a matter of time to hit it in other scripts (that do more complex glyph compositions than Latin diacritics). Though as you will read in the post below, there is a link that might lead you to a fix.

\documentclass{article}
\usepackage[lmargin=0.5in,tmargin=0.5in,rmargin=0.5in,bmargin=0.5in]{geometry}
\usepackage{fontspec}
\usepackage[callback={}]{nodetree}

%\newfontscript{Devanagari}{deva,dev2}
\newfontfamily{\devanagarifam}{Noto Sans Devanagari}[Script=Devanagari, Scale=1, Renderer=HarfBuzz]

\begin{document}

\NodetreeRegisterCallback{hpack_filter}

\setbox0=\hbox{Příliš žluťoučký \textit{kůň} úpěl \hbox{ďábelské} ódy difference diffierence. \devanagarifam एक गांव -- में मोहन नाम का लड़का रहता था। उसके पिताजी एक मामूली मजदूर थे।}

\box0

\NodetreeUnregisterCallback{hpack_filter}

\end{document}

When compiled this piece of code fails with error:

warning  (hpack filter): error: ...texlive/2020/texmf-dist/tex/luatex/nodetree/
nodetree.lua:343: bad argument #1 to 'char' (invalid value)
.
l.13 ...ामूली मजदूर थे।}

?

You might be able to root cause/ fix the issue by reviewing the same bug I encountered in our code today on stackexchange: HarfBuzz UTF-8 issue

I was debugging my own lua code which fails with a similar error as yours, so I am kinda stuck... Nodetree is the only tool out there that arranges tex nodelist in a humane format :) After I have started using nodetree, I cannot imagine using anything else for tex debug, this is a great contribution to the community.

Thanks!

Josef-Friedrich commented 3 years ago

Thank you for your detailed bug report. Can you please try the latest version on the master branch.

The problem is caused by a character with a really high unicode codepoint number: 1180097. In the properties section (properties: {['glyph_info'] = ें}) the character can be printed in the terminal, but not in the glyph section (GLYPH subtype: 256, char: 1180097). I don’t no why?

problematic character

poetaman commented 3 years ago

Thanks Josef, I did test it with your new change! I wonder if that is because glyph_info property is a "string" instead of "char" (I mean it is never treated as a single "char"). Here's my analysis: Either, in HarfBuzz mode, the "char" of glyph nodes really is a "string" of consecutive "chars" (and this fact is known only to a few LuaHBTex developers), or it's an unintended mistake in the way LuaHBTex maintains node-list (that we have encountered). Here's why I speculate:

When you say the problem is caused by a character with a "really high code point number", its actually a sequence of two individual unicode characters ' े' and ' ं'; there is no such single character ' ें' in unicode. The complex Devanagari letter in TeX input (and rightly printed in pdf) that these unicode characters are part of is 'में'. In reality, 'में' is made of 3 unicode characters/glyphs: 'म' + ' े' + ' ं', and thats what I expected to see in node-list. Indeed, if you run this code with Renderer=Node instead of Renderer=HarfBuzz and scroll to the node-list part where 'में' is represented, you will see 3 separate glyphs, each with its respective unicode character:

Renderer=Node output:

├─GLUE subtype: spaceskip, width: 2.6pt, stretch: 1.3pt, shrink: 0.87pt
├─GLYPH subtype: 256, char: 'म', width: 5.98pt, height: 6.22pt
│ ╚═  properties: {['2'] = 5, ['state'] = 1}
├─GLYPH subtype: 256, char: 'े', height: 8.96pt
│ ╚═  properties: {['2'] = 5}
├─GLYPH subtype: 256, char: 'ं', height: 8.3pt
│ ╚═  properties: {['2'] = 5}
├─GLUE subtype: spaceskip, width: 2.6pt, stretch: 1.3pt, shrink: 0.87pt

Renderer=HarfBuzz output:

├─GLUE subtype: spaceskip, width: 2.6pt, stretch: 1.3pt, shrink: 0.87pt
├─GLYPH subtype: 256, char: 'म', width: 5.98pt, height: 6.22pt
│ ╚═  properties: {['glyph_info'] = म}
├─GLYPH subtype: 256, char: 1180086, height: 8.96pt, depth: -6.15pt
│ ╚═  properties: {['glyph_info'] = ें}
├─GLUE subtype: spaceskip, width: 2.6pt, stretch: 1.3pt, shrink: 0.87pt

As you can see in the HarfBuzz output above, there is no way to say that there are 3 unicode glyphs/characters to operate upon, unless one were to assume that the char field is actually a sequence of chars (string), and operates on them individually (character by character). Processed that way, the unicode codepoint number of individual characters will be reasonable and accurate. For your reference & aid for testing, here are the actual unicode code points of ' े' and ' ं': U+0947 and U+0902. Just for completion, 'म' has a unicode code point of U+092E, though it won't be needed for debug as it is a separate glyph in node-list (and rightly so).

khaledhosny commented 3 years ago

LuaTeX’s node.char are not really characters, they are font glyph indices which sometimes happen to match valid Unicode characters. HarfBuzz shapers differentiates between glyph IDs and characters by adding to 0x120000 to glyph ID. I suggest never printing node.char as a character but always as a number.

To print the text for a given node, you need something like the glyph_info callback implementation in luaotfload (works with any of the three modes). Empty glyph info means this glyph is part of larger cluster and nothing should be printed for it (the text for the full cluster is associated with its first glyph).

Josef-Friedrich commented 3 years ago

Thank you very much for this detailed clarification! I will do my very best

poetaman commented 3 years ago

@Josef-Friedrich I just uncovered a bug in LuaTeX Node renderer for a certain class of scripts. This was made possible because of your tool, and temporary fix to node tree for HarfBuzz mode! Thanks so much. @khaledhosny Your opinion/review might be beneficial for these two questions on TeX.SE: LuaTeX glyph reordering issue, and LuaTex text extraction issue. Thanks!

khaledhosny commented 3 years ago

@khaledhosny Your opinion/review might be beneficial for these two questions on TeX.SE: LuaTeX glyph reordering issue, and LuaTex text extraction issue. Thanks!

Use the get_glyph_info() code I mentioned earlier and print its output for all nodes (if it prints nothing for a node, that is expected, see below), it should give you the textual representation of the nodes. The harfbuzz mode is luaotfload uses this to embed text in pdf, and for messages that show textual representation of the nodes like over/underfull messages. It will not result in an error in other modes (that what I meant by saying it works with all modes) but it might not give proper text representation, but that is a limitation of these modes.

It should be understood what the glyph nodes represent. Before processing by luaotfload they represent one-to-one mapping of the input characters. After processing, they represent font glyphs with potentially complicated relationship with input characters.

Relation between input characters and output glyphs are many-to-many. An input character may be represented by one or more glyphs, and output glyph might represent one or more input characters, and in some cases (e.g. when there is reordering) a group of input characters are represented by a group of output glyphs. In the 2nd and 3rd cases, the first glyph nod will have a glyph_info property with all the characters of the group, and subsequent glyph nodes in the group will have empty glyph_info properties.

It should also noted that this mapping is not unique, the same glyph can represent different characters in different context, so getting such mapping from the font data is unreliable, the only reliable way in harfbuzz mode is the node glyph_info property, the other modes don’t retain this mapping.

Josef-Friedrich commented 3 years ago

Could you please review the latest master. I decided to print the textual representation. On verbosity level two the package additionally shows the char number in square brackets.

verbosity=1

├─GLYPH subtype: 256, char: थ, width: 6.42pt, height: 6.32pt
│ ╚═  properties: {['glyph_info'] = थ}

verbosity=2 (or greater)

├─GLYPH[29] no: 1456, subtype: 256, char: थ [2341], font: 34, width: 6.42pt, height: 6.32pt
│ ╚═  properties: {['glyph_info'] = थ}

poetaman commented 3 years ago

@Josef-Friedrich Yes, the issue reported in this bug is gone now! Thanks for the fix, and the thoughtful verbosity feature.

Josef-Friedrich / nodetree

Error when using HarfBuzz renderer #6