printf padding and wide East Asian characters

zopsicle commented 4 years ago

printf padding, such as %10s, is often used for printing text to a terminal in the form of tables. To determine how many spaces to pad the argument with, printf counts the number of graphemes in the argument. But terminals do not display all graphemes the same size. Wide East Asian characters are displayed exactly twice as wide as Latin characters are:

Screenshot of a terminal displaying East Asian text.

When aligning text in columns, this is quite important. A naive alignment implementation based on number of graphemes will cause East Asian text to be offset by a factor around two.

Thus, printf may need to be updated to correctly align East Asian text, either by default or as an option. Beware that printf alignment may also be used in other contexts than terminal printing.

As it stands now, printf alignment is not useful in the typical setting (but may be in others) when East Asian text is a possible input.

zopsicle commented 4 years ago

Unicode does not say much about text rendering, but here is what I know:

There is a Unicode property East_Asian_Width that says whether a character is considered “wide” or “narrow”.
At least some terminals display wide characters at twice the width of narrow characters. I do not know if there are any standards mandating this.
There is a Perl module Unicode::GCString that has a routine called columns that uses the character properties of a grapheme to compute how many columns it would take up, so that alignment can be computed correctly.
The cmus terminal music player (see the screenshot above) correctly cuts off East Asian text, so it must also have some algorithm for determining how wide it is. This is possibly part of ncurses.

AlexDaniel commented 4 years ago

Some possibly helpful notes:

samcv commented 4 years ago

As long as this doesn't affect standard %s usage, I am OK with this feature being added. As long as the programmer knows that the output might not be reproducible between different Unicode versions.

vrurg commented 4 years ago

A couple of things we must not forger about:

*nix terminals are not the only output "devices" to be considered. We must keep in mind Windows consoles, pure text-only consoles, HTML or other markups, GUIs of different kinds too.
sprintf where the number of graphems is likely to be more important than how the resulting string looks in the output.

Can one predict how and where the formatted string is gonna be used? I'm afraid – not.

I'd say that what is needed here is some kind of a way to tell printf to consider wide characters as consuming two positions. Perhaps a named parameter would do the best job here. This way a programmer can be explicit about his intention.

Wether or not to use the parameter is up to the programmer himself. He might base the choice either on his own expectations, or use a 3rd party module which would help him determine if the output destination needs this named parameter. But this is most certainly not something for Raku to do. This kind of heuristics are a moving target, it's not up to a language to shoot it.

niner commented 4 years ago

sprintf already uses formatting codes like %s including width and justification flags. So I'd either add a new flag for string output, or just use a different directive like %w for wide characters.

lizmat commented 4 years ago

@vrurg But what if the programmer is a woman? :-)

vrurg commented 4 years ago

BTW, is there a way to know a string length considering the wide characters as 2-chars? I think this kind of support should also be provided as it has many uses.

lizmat commented 4 years ago

@niner I would be against adding features to printf that are not POSIX compliant

vrurg commented 4 years ago

@lizmat my pardon, influences of Soviet-Russian-speaking background. ;)

lizmat commented 4 years ago

@vrurg that's why I've taken to always use "they" / "them" or "the programmer" :-)

jjatria commented 4 years ago

I think there's also a Raku module that does this, unless I'm misunderstanding the problem: Terminal::WCWidth

Maybe that's helpful to look at?

Kaiepi commented 4 years ago

Skimming over its implementation on OpenBSD, wcwidth determines the width of a character based on locale. IIRC locales are something we try to avoid depending on?

Terminal::WCWidth determines character width based on Unicode properties. If a character is in the Nonspacing_Mark category, then its width is 0. If its East_Asian_Width property is W, then its width is 2. I don't think this implementation is quite right, since fullwidth (F) characters should also be wide, and then there are characters of ambiguous (A) width that can sometimes be wide:

Ambiguous characters occur in East Asian legacy character sets as wide characters, but as narrow (i.e., normal-width) characters in non-East Asian usage. (Examples are the basic Greek and Cyrillic alphabet found in East Asian character sets, but also some of the mathematical symbols.) Private-use characters are considered ambiguous by default, because additional information is required to know whether they should be treated as wide or narrow.

alabamenhu commented 4 years ago

I know I looked into this problem a while ago and I didn't come up with any great answer. The real problem is the ambiguous ones, which have no standard definition, and may result in different displays based on surrounding text (à la ambiguous LTR/RTL characters), terminal application, font (if a character isn't found, will the replacement character be considered wide too?), operating system, etc. I swear I remember using a Terminal once that printed wide characters as width 1.5.

I really wish that this were something that could be put into core, but I don't think that it can be reasonably done.

My recommendation would be to create a module that provides a drop-in-replacement for printf that employs Unicode data, context information (wide on left/right/both/neither side, for ambiguous), OS, locale, and terminal systems to do the print. The latter three are why it needs to be out of core. You can see in Intl::LanguageTag some of the things I've done to detect the locales: support isn't universal and already has to approach some operating systems from multiple angles. Add in a few other factors and its gets messy fast.

It won't be super easy, but it will be possible to do with good-enough accuracy.¹ It would potentially be fairly easy to add new configurations when someone finds an issue by creating a decent test suite that prints out a bunch of characters to a terminal, terminating each line with a pipe, and counting offsets.

Really, such a module should probably end up being a one-stop-shop for a bunch of other terminal-related issues, because wide characters can create misaligned columns (col 5 on one row might be aligned with col 6 on another, and updating a wide cell to a narrow one or vice verse can shift the line.

samcv commented 4 years ago

I feel there are really two totally different problems here:

Trying to output text onto a grid (columns are an example)
Limiting the number of Unicode characters we pass (our renderer can handle too few characters fine. Too many characters break things)

Point 1 is clearly violating the purposes of Unicode, and is best handled in the text renderer. Raku should have the features needed for a module to attempt this.

Point 2 is much more generic. You can imagine many other cases where this could be useful. If we have a long text file and want to split it into pages, our printer can handle too few words ending up on the page, but too many and we will lose data. If we have a terminal and lines get cut off, or wrap to the next line, that is also an issue.

So if we are going to solve this is Raku then we have to make sure we are only solving problem 2 and not problem 1.

Some other comments:

Unicode is clear that it is the text renderer which should deal with issues of width of characters. More recently though (well, not that recently, but recently compared to the length of the Unicode project) unless anything has changed, emoji have East Asian Width set while originally they did not to accommodate use of East Asian Width in contexts outside of the original purpose of the property.

Raku / problem-solving

printf padding and wide East Asian characters #171