Smallcaps doesn't work with non-English characters

jp-larose commented 7 months ago

Asciidoctor PDF seems to ignore non-English characters when transforming a string to small caps.

At first I thought it was just accents, but I decided to check with Greek characters as well just to see, and they don't get the expected small caps treatment either.

Here's the relevant text in my .adoc file:

== Compétences Δενδρων Έϊϋόύώ

And the theme file:

heading-h2:
  text-transform: smallcaps

Compare this to MS Word's treatment:

Suggestion: Use built-in Unicode functions to find and transform lowercase characters to their uppercase counterparts.

I've not looked at the code, I haven't learned any Ruby, but I know that most programming languages have Unicode support.

Environment:

asciidoctor-pdf -v                                                                                                                  
Asciidoctor PDF 2.3.10 using Asciidoctor 2.0.20 [https://asciidoctor.org]
Runtime Environment (ruby 3.1.3p185 (2022-11-24 revision 1a6b16756e) [x64-mingw-ucrt]) (lc:IBM437 fs:UTF-8 in:UTF-8 ex:UTF-8)

mojavelinux commented 7 months ago

I've done what I'm willing to do, and what is reasonably supported by the built-in fonts. See https://github.com/asciidoctor/asciidoctor-pdf/issues/1192#issuecomment-1116710989 If you would like to see the smallcaps mapping extended to include support for other character sets, it's something you will need to contribute (with tests).

mojavelinux commented 7 months ago

Use built-in Unicode functions to find and transform lowercase characters to their uppercase counterparts.

It's not as simple as this. They're all uppercase characters, but from different parts of the Unicode plane. What you're looking for are smallcap uppercase counterparts, which is not something that Ruby's stdlib provides (and not well-defined in Unicode either).

Currently, Asciidoctor PDF maintains a mapping for a-z, and some of those characters were carefully chosen to increase the likeliness it is found in the font. It's just not a simple matter.

If you'd like to enhance Asciidoctor PDF's behavior to add additional mappings, you can do so by overriding the smallcaps in an extended converter. For example:

class MyPDFConverter < (Asciidoctor::Converter.for 'pdf')
  register_for 'pdf'

  def smallcaps string
    super.gsub 'é', %(\u1d07\u0301)
  end
end

However, you need to use a font that supports the combining characters like the combining acute accent. The built-in fonts don't include those. However, I could perhaps add them at some point (see #2482).

If you need more assistance or ideas, please ask in the project chat at https://chat.asciidoctor.org.

jp-larose commented 6 months ago

I'm not invested enough to learn Ruby and all the required libraries just to fix this. I was curious, however, and had a peek and the code. It looks like you're using small caps in the font. I'm not sure how widely implemented that is across different fonts, but regardless, it seems that the available unicode codepoints for those small caps are also anglocentric.

I suspect a more universal implementation of small caps would be font-independant, but still use unicode features to determine and transform casing.

Roughly, I picture a solution to look like:

Iterate through the text that should be in smallcaps
- If the character is a lowercase letter (c.is_lowercase() or something similar), then find it's uppercase counterpart (c.to_uppercase() or something similar), and render it with a with the current font style but at a reduced size (e.g. 1ex of the current font size).
- If the character is a "combining diacritic mark" (a.k.a. accents), look ahead one character:
- If that look ahead character is lowercase, render the current character (the accent) at a reduced size,
- otherwise render the current character (the accent) at the normal size.
- Otherwise, render it without further transformation.

In theory, this should transform any script that distinguishes between upper- and lower-case, account for diacritic marks (a.k.a. accents) that are separate from their letters, and leave characters that have no lowercase counterparts unchanged. There may be some special cases I've not considered, but at least it's a start.

Again, I'm not really invested in this. I just ended up removing smallcaps from the template I was using it in.

However, the current implementation of smallcaps is buggy, and in the interest of improving this project, I suggest leaving this issue open until someone comes in to fix it. That may be you, maybe a future version of me that really wants this, maybe someone else entirely.

mojavelinux commented 6 months ago

No, the current implementation is not buggy. It works exactly how it's designed to work and is thoroughly tested. (I'm happy to add a note to the documentation to make it clear which character ranges this transformation supports). Making an accusation that it's buggy is not they way to get what you want. It's true that the implementation of smallcaps in Asciidoctor PDF is ASCII-centric. And I've made the argument (consistent with your observation) that it's because the smallcaps transformation is, itself, ASCII-centric. I know of no official mapping in Unicode outside of A-Z that actually describes how uppercase characters are mapped to x-height equivalents (small capitals). If there is one, you can point me to it. (Other software may provide bespoke behavior*, but that doesn't make what Asciidoctor PDF does incorrect).

Regardless, that still doesn't settle the fact that most fonts, including the bundled ones, don't provide glyphs of non-ASCII x-height uppercase characters (which typically require a second code point, as described above). So all the user would end up getting beyond the ASCII characters are missing glyph boxes anyway.

find it's uppercase counterpart (c.to_uppercase() or something similar), and render it with a with the current font style but at a reduced size (e.g. 1ex of the current font size).

This is bespoke behavior, not the smallcaps transformation. I've already explained how it is that you can add this bespoke behavior using an extended converter. (Alternately, you could apply a custom role in the AsciiDoc source itself to reduce the font size of select characters, perhaps combined with the uppercase transformation). If you're unwilling or unable to implement that bespoke behavior, that's not my responsibility as a maintainer.

Since I've explained that there's no other definition of smallcaps than the one used, I don't have any plans to pursue a change. And you, by your own admission, are not invested in this. Therefore, I will not keep this issue open. (I have, however, elaborated the documentation as promised, see https://docs.asciidoctor.org/pdf-converter/latest/theme/text/#transform).

mojavelinux commented 6 months ago

I realized that it's possible to call .unicode_normalize :nfd to rewrite a string so all characters with diacritical marks are written using a combining character. If we then add those combining characters to the bundled fonts, then the smallcaps transformation will automatically start working for words that use characters with diacritical marks (since the ASCII letter to transform as been split out in the string). Given this is a breaking change, it will need to be made in Asciidoctor PDF 3. However, I'm still going to add the combining characters to the bundled fonts in 2.3.x.

asciidoctor / asciidoctor-pdf

Smallcaps doesn't work with non-English characters #2473