PyICU example was wrong: PyICU iterators don't work on Python strings

On the front page, the code claiming ICU is buggy is doing it wrong. ICU returns indices as UTF-16 or UTF-8 indices, not "character" indices like Python expects. Here is a fix:

diff --git a/test.py b/test.py
index d040782..1318933 100644
--- a/test.py
+++ b/test.py
@@ -1,12 +1,13 @@
 import icu
 def iterate_breaks(text, break_iterator):
+    text = icu.UnicodeString(text)
     break_iterator.setText(text)
     lastpos = 0
     while True:
         next_boundary = break_iterator.nextBoundary()
         print(next_boundary)
         if next_boundary == -1: return
-        yield text[lastpos:next_boundary]
+        yield str(text[lastpos:next_boundary])
         lastpos = next_boundary
 bi = icu.BreakIterator.createCharacterInstance(icu.Locale.getRoot())

One can argue it's a PyICU misfortune.

@behdad Feeling pretty flattered that the author of harfbuzz checked my library out 🤩 TLDR: PyICU runs correctly after the fix, but now more than 8x slower.

Even more unfortunately, the PyICU Cheat Sheet linked to at the very end of the official PyICU README also makes this same mistake:

def iterate_breaks(text, break_iterator):
    break_iterator.setText(text)
    lastpos = 0
    while True:
        next_boundary = break_iterator.nextBoundary()
        if next_boundary == -1: return
        yield text[lastpos:next_boundary]
        lastpos = next_boundary

which I just noticed also tripped up other users. Notice that their example of using iterate_breaks passes in a Python string, not an icu.UnicodeString..

After your fix, I get the correct output, thank you!

But now, because you have to instantiate a icu.UnicodeString instance for each string you pass in and then also instantiate a native string instance for each grapheme, PyICU is 8.4 times instead of ~5 times slower:

In [3]: print(','.join(iterate_breaks("Hello 👩🏽‍🔬! 👩🏼‍❤️‍💋‍👨🏾 अनुच्छेद", bi)
H,e,l,l,o, ,👩🏽‍🔬,!, ,👩🏼‍❤️‍💋‍👨🏾, ,अ,नु,च्छे,द

In [4]: %%timeit
   ...: list(iterate_breaks("Hello 👩🏽‍🔬! 👩🏼‍❤️‍💋‍👨🏾 अनुच्छेद", bi))
2.84 μs ± 23.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [5]: %%timeit
   ...: grapheme_split("Hello 👩🏽‍🔬! 👩🏼‍❤️‍💋‍👨🏾 अनुच्छेद")
337 ns ± 4.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

I am guessing the PyICU folks forgot about this when they added auto-conversion of strings to/from the C++ side. Ideally the PyICUs iterators just return the indices into Python strings, but this would break existing users. Probably easiest for them to add a new Python-friendlier collection of iterators, which would also improve perf significantly

Hoping you are finding the grapheme-splitting code amusing. It's around 300 lines of code (including white space and comments). It's in Cython, but basically it's C. In case you missed it, it ends at line 343 in ugrapheme.pyx. I think the approach is novel, simple, fast and hopefully accurate.

@behdad Feeling pretty flattered that the author of harfbuzz checked my library out 🤩

Thanks for the kind words. :) Someone brought it up in a group I'm part of, and the claim that ICU is broken caught my attention. :)

I will be recommending your library if someone needs to build a text layout engine in Python.

TLDR: PyICU runs correctly after the fix, but now more than 8x slower.

Even more unfortunately, the PyICU Cheat Sheet linked to at the very end of the official PyICU README also makes this same mistake:
def iterate_breaks(text, break_iterator):
    break_iterator.setText(text)
    lastpos = 0
    while True:
        next_boundary = break_iterator.nextBoundary()
        if next_boundary == -1: return
        yield text[lastpos:next_boundary]
        lastpos = next_boundary
which I just noticed also tripped up other users. Notice that their example of using iterate_breaks passes in a Python string, not an icu.UnicodeString..

We should report it to them at least.

I am guessing the PyICU folks forgot about this when they added auto-conversion of strings to/from the C++ side. Ideally the PyICUs iterators just return the indices into Python strings, but this would break existing users. Probably easiest for them to add a new Python-friendlier collection of iterators, which would also improve perf significantly

I think they should change the current API to just do the right thing. The current behvaior, that it works for BMP not broken for beyond-BMP, is the worst kind of Unicode bug many code has been plagued with, all thanks to UTF-16... I'm fairly confident no one is using PyICU in the broken way and fixing up the indices manually after. So if anything, PyICU fixing the behavior now, is very unlikely to break existing users; more so that it will fix existing users.

At any rate, if you don't mind, please report it to them and link from here for others to find.

Hoping you are finding the grapheme-splitting code amusing. It's around 300 lines of code (including white space and comments). It's in Cython, but basically it's C. In case you missed it, it ends at line 343 in ugrapheme.pyx. I think the approach is novel, simple, fast and hopefully accurate.

So, it's a state-machine hand-coded, right? The closest I've seen to that is Pango's handcoded grapheme code, but using big switch statements. That was horrible.

Your approach is nice, but a bit hard to verify correctness. In HarfBuzz, and elsewhere, I'm a huge fan of the Ragel state-machine generator tool. It's been indispensable. See, for example: https://github.com/google/emoji-segmenter

I also have a speedup idea for you. To change your main switch, _grapheme_split_uint32, into a dictionary mapping state to callbacks:


actions = {
  RULE_BRK: rule_brk,
  RULE_CRLF: lambda cur, nxt: rule_crlf(),
  RULE_PRECORE: lambda cur, nxt: rule_precore(nxt),
  ...
}

cdef inline uint16_t _grapheme_split_uint32(
     uint16_t tran, uint32_t cur, uint32_t nxt) noexcept:
    cdef SplitRule rule = <SplitRule> (tran & 0xff)
    return actions.get(rule, lambda cur, nxt: 0)(cur, next)

Donno if you can make that work efficiently with Cython. Also, you can make all your rule functions take the same arguments, to avoid the overhead of the `lambda`s. Hope that speeds up your code even further!

claim that ICU is broken caught my attention. :)

I was careful to only talk about _Py_ICU as opposed to ICU - I should have made that more explicit, although at least it baited you to take a look ;)

Will definitely report to PyICU, thanks! 👍 👍

As far as the speedup ideas:

Cython will already recognize the if/elif form and convert it into a switch, as the optimize.use_switch = True by default (see here). To convince yourself, you can try to git clone the tree and python setup.py build_ext --inplace and check out the generated ugraphemes.c
Similarly all of the rules are defined using the inline keyword meaning they will be compiled as static inside the C-module and thus invisible outside the translation unit. This makes them ripe for inlining and indeed this happens. You can verify with objdump -S on the .so file that all these rule_ function actually get inlined into the grapheme_split_uint32 function.
The trivial Cython datatypes I use trivially get allocated as registers and optimized as if you wrote this same code pretty much line by line in C and compiled it with the latest gcc or clang (also verifiable by taking a glance at the generate ugraphemes.c). I elided cur and/or nxt on some calls to help readability (the rules indeed do not always need context) and the generated optimized inline code will for smaller rule deal away with any idea of a function call, arguments or return value in the actual generated assembly.
With the dictionary you provided, I think you are at the asm/C/C++-level suggesting a jump-table approach. Depending on the architecture and the number of branches, the compiler might or might not take that option. See, for example, the 2017 article Are Jump Tables Always Fastest? here that does a survey of literature about why on today's architectures you won't always choose one or the other. You might notice, that, for instance, gcc has parameters for when to do one vs. the other (Optimize Options). You might also find entertaining the proposal for switch lowering in LLVM all the way back from 2007 that, for example, entertains introducing a balanced tree of compares if the layout of switch cases is just right.
In fact, in the clang-compiled code for Apple Silicon chips, you'll end up with a mixture of direct compares and branches and jump tables. The brk, crlf, precore, prepend, postcore, ri, and emoji_* rules/productions are going to be completely inlined while hangul will be taken care of in a separate function as well all of the Devanagari consonant rules, which will be merged into another separate combined function. All of the is_ predicates will be inlined as well. The "hinting" I would do with some hand rolled jump tables would just get in the way and slow the GCC/Clang in doing their thing (as discussed in the above articles I linked).

Now the question I always ask myself is: to what level you can push the abstraction until the optimizations I mentioned above simply do not happen under gcc, clang or both? I tried to make the rules a simple transcription of the regex-grammar in UAX 29 implemented using the most basic CPU operations, one codepoint at a time with a single 16-bit state.. and then ofcourse see what performance I can get out of all this. I understand that this can get in the way of provability, readability, etc. especially depedning on what stuff the programmer audience is used to reading. But then again my brain still lives in pre-history and this project was just a side quest while coding on my trusty Amiga 500 running on 7.09 MHz and 1MB of RAM.

That said, while it might make some sense to think about my code as a DFA, I like to look at it as a non-backtracking, non-ambigous 1 character lookahead parser with productions (rules) and terminals. Upon encountering a terminal, instead of advancing by 1 character and continuing to consume, I return with an identifier of the next production to be invoked (can be the same production I was parsing). Given that there is no other state it maintains, you can think of it as DFA, but I think of the uint16_t tran as a continuation to be invoked when I want to resume parsing, that is, when I want/can parse the next unicode codepoint. The rule_ functions more/less faithfully follow Table 1b and Table 1c from UAX #29.

Thanks for the very enjoyable read! I oversaw that the if/else block was not running on the Python interpretter, even though I mentioned Cython. And thanks for all the pointers and information.

As for the DFA or recursive-decent parser or continuation, etc, I really love how the same code can be looked at from so many different angles as soon as you accept that code is data...

Z4JC / ugrapheme

PyICU example was wrong: PyICU iterators don't work on Python strings #1