dankamongmen / notcurses

blingful character graphics/TUI library. definitely not curses.
https://nick-black.com/dankwiki/index.php/Notcurses
Other
3.64k stars 115 forks source link

combining emoji aren't correctly sized #1329

Open dankamongmen opened 3 years ago

dankamongmen commented 3 years ago

We don't seem to get correct sizing for combining emoji. Take, for instance, "dark-toned woman mechanic" aka πŸ‘©πŸΎβ€πŸ”§ aka U+1F469 WOMAN + U+1F3FE EMOJI MODIFIER FITZPATRICK TYPE-5 + U+200D ZERO WIDTH JOINER + U+1F527 WRENCH aka πŸ‘©+πŸΎβ€+πŸ”§. This shows up as three wide glyphs in xfce4-terminal and st, and a single wide glyph in kitty, so that's one thing to deal with (kitty is the correct one here). ncwidth follows dumbly along with classic wcwidth():

[schwarzgerat](0) $ ./ncwidth πŸ‘©πŸΎβ€πŸ”§
0x1f469: 2 πŸ‘©    0x1f3fe: 2 🏾    0x0200d: 0 ‍    0x1f527: 2 πŸ”§    

 total width: 6  total bytes: 15

πŸ‘©πŸΎβ€πŸ”§
012345
[schwarzgerat](0) $ 

so we're going to count this as six columns wide in all cases. This is probably related to why we get the clear area of spillover in mojibake when running under kitty:

2021-02-04-212711_802x1417_scrot

so (assuming wcswidth() gets this wrong, also -- check that out), we need to either (a) start parsing up a bunch of extra Unicode data files ourselves (as the jquast/wcwidth project does, see https://github.com/jquast/wcwidth/issues/39), or (b) send patches to the relevant libcs (if this behavior is even compatible with the ANSI C wcswidth() definition), or (c) find a suitably fast wcswidth() alternative (ideally with a native C interface).

dankamongmen commented 3 years ago

I reached out to @jquast to see if he would find such a thing to be of use. I also reached out to Daniel Lemire, author of simdjson, to see whether he's done anything in this space -- I would likely use simdjson-pioneered techniques to implement this, if it comes down to me.

kovidgoyal commented 3 years ago

FYI kitty already has code to automatically generate wcswidth from the unicode standard. It's been on my todo list to factor that out into its own C library, but never been motivated enough. I cant say I think wcswidth is enough of a bottleneck to bother with SIMD though. But hey if you want to do that, I wont stop you :)

dankamongmen commented 3 years ago

i have been rudely surprised by the discovery that wcwidth() is a POSIX function, not an ANSI C one, which makes sense when you think about it, but is rather inconvenient. I'm not much impacted by its absence on GNU Hurd, but the lack of wcwidth() on Windows is going to be an official Problem.

on the other hand, that opens the gates for a portable, high-quality implementation. and this, unlike notcurses proper, ought feel free to reach into font tables and other grime if/as necessary.

additionally, this could carry information about glyphs which have different behavior in different terminals, and thus encourage use of a cursor location report. just throwing out ideas. either way, we're gonna need to do better on Windows than our current

#define wcwidth(x) 1

heh

jquast commented 3 years ago

I think we should do this. I saw a C-bindings alternative release this year and it makes my heart sink a little, the posix C bindings are so awful, but not so slow! The readme there shows timeit tests of 20x improvement. https://github.com/sebastinas/cwcwidth

I’d really like to try to make wcwidth generate C code by end of year and be a drop-in replacement for existing uses of the library. also, I’d like to introduce a new easier API function that doesn’t return -1, yikes.

dankamongmen commented 3 years ago

I think we should do this. I saw a C-bindings alternative release this year and it makes my heart sink a little, the posix C bindings are so awful, but not so slow! The readme there shows timeit tests of 20x improvement. https://github.com/sebastinas/cwcwidth

so i did some basic calculations assuming a straight up flat uint32_t-indexed O(1) data structure, the fastest thing possible (ignoring cache effects for now).

assuming an arbitrarily-aligned lookup table and 64-byte cachelines, you're gonna be able to load up all of ascii in 2 lines iff you do a byte per codepoint. the BMP alone would be 1Ki cachelines, probably blowing out a single core's L1. i don't think you can reasonably go below 3 bits for width, and i really think 4 is a better idea. so can you even get to a byte? not unless you really want to cut out other properties. so figure at least a byte, maybe 2 per codepoint.

but, we only actually use 5 of 17 planes, so with a O(1) int->int map there we get an offset for our page within a structure leaving out the 12 unused planes. now we're talking 640KiB at a byte per. now we're talking 5120 cache lines for all of unicode rather than 17408.

it would be best if we were not computing any of this table at startup, so that everything can be demand-paged, and ideally we never use much of the table at all.

kovidgoyal commented 3 years ago

Dont use a table. At least for width there are vast ranges of the space that all have the same width value. And the overwhelmingly common case is using simple ascii chars with width 1. So a switch with with a if for the ascii case is the most efficient implementation, given that branch prediction will rarely miss. This is actually true of most unicode properties so in kitty I just use switch with if for common cases for pretty much all unicode properties I care about.

dankamongmen commented 3 years ago

whatever is done, i want an API that i can feed a sequence of utf-8 or utf-32 and have segmentation occur along with column approximation, and furthermore it needs be reentrant in the sense that