Open dankamongmen opened 3 years ago
I reached out to @jquast to see if he would find such a thing to be of use. I also reached out to Daniel Lemire, author of simdjson, to see whether he's done anything in this space -- I would likely use simdjson-pioneered techniques to implement this, if it comes down to me.
FYI kitty already has code to automatically generate wcswidth from the unicode standard. It's been on my todo list to factor that out into its own C library, but never been motivated enough. I cant say I think wcswidth is enough of a bottleneck to bother with SIMD though. But hey if you want to do that, I wont stop you :)
i have been rudely surprised by the discovery that wcwidth()
is a POSIX function, not an ANSI C one, which makes sense when you think about it, but is rather inconvenient. I'm not much impacted by its absence on GNU Hurd, but the lack of wcwidth()
on Windows is going to be an official Problem.
on the other hand, that opens the gates for a portable, high-quality implementation. and this, unlike notcurses proper, ought feel free to reach into font tables and other grime if/as necessary.
additionally, this could carry information about glyphs which have different behavior in different terminals, and thus encourage use of a cursor location report. just throwing out ideas. either way, we're gonna need to do better on Windows than our current
#define wcwidth(x) 1
heh
I think we should do this. I saw a C-bindings alternative release this year and it makes my heart sink a little, the posix C bindings are so awful, but not so slow! The readme there shows timeit tests of 20x improvement. https://github.com/sebastinas/cwcwidth
Iβd really like to try to make wcwidth generate C code by end of year and be a drop-in replacement for existing uses of the library. also, Iβd like to introduce a new easier API function that doesnβt return -1, yikes.
I think we should do this. I saw a C-bindings alternative release this year and it makes my heart sink a little, the posix C bindings are so awful, but not so slow! The readme there shows timeit tests of 20x improvement. https://github.com/sebastinas/cwcwidth
so i did some basic calculations assuming a straight up flat uint32_t
-indexed O(1) data structure, the fastest thing possible (ignoring cache effects for now).
assuming an arbitrarily-aligned lookup table and 64-byte cachelines, you're gonna be able to load up all of ascii in 2 lines iff you do a byte per codepoint. the BMP alone would be 1Ki cachelines, probably blowing out a single core's L1. i don't think you can reasonably go below 3 bits for width, and i really think 4 is a better idea. so can you even get to a byte? not unless you really want to cut out other properties. so figure at least a byte, maybe 2 per codepoint.
but, we only actually use 5 of 17 planes, so with a O(1) int->int map there we get an offset for our page within a structure leaving out the 12 unused planes. now we're talking 640KiB at a byte per. now we're talking 5120 cache lines for all of unicode rather than 17408.
it would be best if we were not computing any of this table at startup, so that everything can be demand-paged, and ideally we never use much of the table at all.
Dont use a table. At least for width there are vast ranges of the space that all have the same width value. And the overwhelmingly common case is using simple ascii chars with width 1. So a switch with with a if for the ascii case is the most efficient implementation, given that branch prediction will rarely miss. This is actually true of most unicode properties so in kitty I just use switch with if for common cases for pretty much all unicode properties I care about.
whatever is done, i want an API that i can feed a sequence of utf-8 or utf-32 and have segmentation occur along with column approximation, and furthermore it needs be reentrant in the sense that
We don't seem to get correct sizing for combining emoji. Take, for instance, "dark-toned woman mechanic" aka π©πΎβπ§ aka U+1F469 WOMAN + U+1F3FE EMOJI MODIFIER FITZPATRICK TYPE-5 + U+200D ZERO WIDTH JOINER + U+1F527 WRENCH aka π©+πΎβ+π§. This shows up as three wide glyphs in
xfce4-terminal
andst
, and a single wide glyph inkitty
, so that's one thing to deal with (kitty is the correct one here).ncwidth
follows dumbly along with classicwcwidth()
:so we're going to count this as six columns wide in all cases. This is probably related to why we get the clear area of spillover in
mojibake
when running underkitty
:so (assuming
wcswidth()
gets this wrong, also -- check that out), we need to either (a) start parsing up a bunch of extra Unicode data files ourselves (as the jquast/wcwidth project does, see https://github.com/jquast/wcwidth/issues/39), or (b) send patches to the relevant libcs (if this behavior is even compatible with the ANSI Cwcswidth()
definition), or (c) find a suitably fastwcswidth()
alternative (ideally with a native C interface).