combining emoji aren't correctly sized

dankamongmen / notcurses

blingful character graphics/TUI library. definitely not curses.

https://nick-black.com/dankwiki/index.php/Notcurses

Other

3.64k stars 115 forks source link

combining emoji aren't correctly sized #1329

Open dankamongmen opened 3 years ago

dankamongmen commented 3 years ago

We don't seem to get correct sizing for combining emoji. Take, for instance, "dark-toned woman mechanic" aka 👩🏾‍🔧 aka U+1F469 WOMAN + U+1F3FE EMOJI MODIFIER FITZPATRICK TYPE-5 + U+200D ZERO WIDTH JOINER + U+1F527 WRENCH aka 👩+🏾‍+🔧. This shows up as three wide glyphs in xfce4-terminal and st, and a single wide glyph in kitty, so that's one thing to deal with (kitty is the correct one here). ncwidth follows dumbly along with classic wcwidth():

[schwarzgerat](0) $ ./ncwidth 👩🏾‍🔧
0x1f469: 2 👩    0x1f3fe: 2 🏾    0x0200d: 0 ‍    0x1f527: 2 🔧    

 total width: 6  total bytes: 15

👩🏾‍🔧
012345
[schwarzgerat](0) $

so we're going to count this as six columns wide in all cases. This is probably related to why we get the clear area of spillover in mojibake when running under kitty:

2021-02-04-212711_802x1417_scrot

so (assuming wcswidth() gets this wrong, also -- check that out), we need to either (a) start parsing up a bunch of extra Unicode data files ourselves (as the jquast/wcwidth project does, see https://github.com/jquast/wcwidth/issues/39), or (b) send patches to the relevant libcs (if this behavior is even compatible with the ANSI C wcswidth() definition), or (c) find a suitably fast wcswidth() alternative (ideally with a native C interface).

dankamongmen commented 3 years ago

I reached out to @jquast to see if he would find such a thing to be of use. I also reached out to Daniel Lemire, author of simdjson, to see whether he's done anything in this space -- I would likely use simdjson-pioneered techniques to implement this, if it comes down to me.

kovidgoyal commented 3 years ago

FYI kitty already has code to automatically generate wcswidth from the unicode standard. It's been on my todo list to factor that out into its own C library, but never been motivated enough. I cant say I think wcswidth is enough of a bottleneck to bother with SIMD though. But hey if you want to do that, I wont stop you :)

dankamongmen commented 3 years ago

i have been rudely surprised by the discovery that wcwidth() is a POSIX function, not an ANSI C one, which makes sense when you think about it, but is rather inconvenient. I'm not much impacted by its absence on GNU Hurd, but the lack of wcwidth() on Windows is going to be an official Problem.

on the other hand, that opens the gates for a portable, high-quality implementation. and this, unlike notcurses proper, ought feel free to reach into font tables and other grime if/as necessary.

additionally, this could carry information about glyphs which have different behavior in different terminals, and thus encourage use of a cursor location report. just throwing out ideas. either way, we're gonna need to do better on Windows than our current

#define wcwidth(x) 1

heh

jquast commented 3 years ago

I think we should do this. I saw a C-bindings alternative release this year and it makes my heart sink a little, the posix C bindings are so awful, but not so slow! The readme there shows timeit tests of 20x improvement. https://github.com/sebastinas/cwcwidth

I’d really like to try to make wcwidth generate C code by end of year and be a drop-in replacement for existing uses of the library. also, I’d like to introduce a new easier API function that doesn’t return -1, yikes.

dankamongmen commented 3 years ago

I think we should do this. I saw a C-bindings alternative release this year and it makes my heart sink a little, the posix C bindings are so awful, but not so slow! The readme there shows timeit tests of 20x improvement. https://github.com/sebastinas/cwcwidth

so i did some basic calculations assuming a straight up flat uint32_t-indexed O(1) data structure, the fastest thing possible (ignoring cache effects for now).

17 * 2^16 == 1114112 (1Mi + 64Ki), so less than 2MiB at a byte per codepoint
it would be nice to have some properties encoded along with the width
there are six bidi properties, but bidi's rare enough that maybe it's better to just have 1 bit devoted to "bidi-affected", and go look that up in another structure. these are otherwise independent, so you need at least 6 bits.
2 bits for numeric type (4 classifications)
4 types of boundary, not sure about the relation between them but probably somewhat independent
whitespace is a single bit
4 bits for punctuation classes

assuming an arbitrarily-aligned lookup table and 64-byte cachelines, you're gonna be able to load up all of ascii in 2 lines iff you do a byte per codepoint. the BMP alone would be 1Ki cachelines, probably blowing out a single core's L1. i don't think you can reasonably go below 3 bits for width, and i really think 4 is a better idea. so can you even get to a byte? not unless you really want to cut out other properties. so figure at least a byte, maybe 2 per codepoint.

but, we only actually use 5 of 17 planes, so with a O(1) int->int map there we get an offset for our page within a structure leaving out the 12 unused planes. now we're talking 640KiB at a byte per. now we're talking 5120 cache lines for all of unicode rather than 17408.

it would be best if we were not computing any of this table at startup, so that everything can be demand-paged, and ideally we never use much of the table at all.

kovidgoyal commented 3 years ago

Dont use a table. At least for width there are vast ranges of the space that all have the same width value. And the overwhelmingly common case is using simple ascii chars with width 1. So a switch with with a if for the ascii case is the most efficient implementation, given that branch prediction will rarely miss. This is actually true of most unicode properties so in kitty I just use switch with if for common cases for pretty much all unicode properties I care about.

dankamongmen commented 3 years ago

whatever is done, i want an API that i can feed a sequence of utf-8 or utf-32 and have segmentation occur along with column approximation, and furthermore it needs be reentrant in the sense that

i get a bit back telling me whether this is absolutely the end of the EGC (implying if not set that it matches the prefix of some longer EGC)
i can call with such a prefix plus new data (or ideally just the new data) and get the new combined EGC, or a new EGC based off the new data only, so another bit here (or api design around it, etc)