Z4JC / ugrapheme

Unicode Extended grapheme clusters in nanoseconds
BSD Zero Clause License
21 stars 0 forks source link

PyICU example was wrong: PyICU iterators don't work on Python strings #1

Open behdad opened 1 month ago

behdad commented 1 month ago

On the front page, the code claiming ICU is buggy is doing it wrong. ICU returns indices as UTF-16 or UTF-8 indices, not "character" indices like Python expects. Here is a fix:

diff --git a/test.py b/test.py
index d040782..1318933 100644
--- a/test.py
+++ b/test.py
@@ -1,12 +1,13 @@
 import icu
 def iterate_breaks(text, break_iterator):
+    text = icu.UnicodeString(text)
     break_iterator.setText(text)
     lastpos = 0
     while True:
         next_boundary = break_iterator.nextBoundary()
         print(next_boundary)
         if next_boundary == -1: return
-        yield text[lastpos:next_boundary]
+        yield str(text[lastpos:next_boundary])
         lastpos = next_boundary
 bi = icu.BreakIterator.createCharacterInstance(icu.Locale.getRoot())
behdad commented 1 month ago

One can argue it's a PyICU misfortune.

Z4JC commented 1 month ago

@behdad Feeling pretty flattered that the author of harfbuzz checked my library out 🀩 TLDR: PyICU runs correctly after the fix, but now more than 8x slower.

Even more unfortunately, the PyICU Cheat Sheet linked to at the very end of the official PyICU README also makes this same mistake:

def iterate_breaks(text, break_iterator):
    break_iterator.setText(text)
    lastpos = 0
    while True:
        next_boundary = break_iterator.nextBoundary()
        if next_boundary == -1: return
        yield text[lastpos:next_boundary]
        lastpos = next_boundary

which I just noticed also tripped up other users. Notice that their example of using iterate_breaksΒ passes in a Python string, not an icu.UnicodeString..

After your fix, I get the correct output, thank you!

But now, because you have to instantiate a icu.UnicodeString instance for each string you pass in and then also instantiate a native string instance for each grapheme, PyICU is 8.4 times instead of ~5 times slower:

In [3]: print(','.join(iterate_breaks("Hello πŸ‘©πŸ½β€πŸ”¬! πŸ‘©πŸΌβ€β€οΈβ€πŸ’‹β€πŸ‘¨πŸΎ ΰ€…ΰ€¨ΰ₯ΰ€šΰ₯ΰ€›ΰ₯‡ΰ€¦", bi)
H,e,l,l,o, ,πŸ‘©πŸ½β€πŸ”¬,!, ,πŸ‘©πŸΌβ€β€οΈβ€πŸ’‹β€πŸ‘¨πŸΎ, ,ΰ€…,ΰ€¨ΰ₯,ΰ€šΰ₯ΰ€›ΰ₯‡,ΰ€¦

In [4]: %%timeit
   ...: list(iterate_breaks("Hello πŸ‘©πŸ½β€πŸ”¬! πŸ‘©πŸΌβ€β€οΈβ€πŸ’‹β€πŸ‘¨πŸΎ ΰ€…ΰ€¨ΰ₯ΰ€šΰ₯ΰ€›ΰ₯‡ΰ€¦", bi))
2.84 ΞΌs Β± 23.5 ns per loop (mean Β± std. dev. of 7 runs, 100,000 loops each)

In [5]: %%timeit
   ...: grapheme_split("Hello πŸ‘©πŸ½β€πŸ”¬! πŸ‘©πŸΌβ€β€οΈβ€πŸ’‹β€πŸ‘¨πŸΎ ΰ€…ΰ€¨ΰ₯ΰ€šΰ₯ΰ€›ΰ₯‡ΰ€¦")
337 ns Β± 4.1 ns per loop (mean Β± std. dev. of 7 runs, 1,000,000 loops each)

I am guessing the PyICU folks forgot about this when they added auto-conversion of strings to/from the C++ side. Ideally the PyICUs iterators just return the indices into Python strings, but this would break existing users. Probably easiest for them to add a new Python-friendlier collection of iterators, which would also improve perf significantly

Hoping you are finding the grapheme-splitting code amusing. It's around 300 lines of code (including white space and comments). It's in Cython, but basically it's C. In case you missed it, it ends at line 343 in ugrapheme.pyx. I think the approach is novel, simple, fast and hopefully accurate.

behdad commented 1 month ago

@behdad Feeling pretty flattered that the author of harfbuzz checked my library out 🀩

Thanks for the kind words. :) Someone brought it up in a group I'm part of, and the claim that ICU is broken caught my attention. :)

I will be recommending your library if someone needs to build a text layout engine in Python.

TLDR: PyICU runs correctly after the fix, but now more than 8x slower.

Even more unfortunately, the PyICU Cheat Sheet linked to at the very end of the official PyICU README also makes this same mistake:

def iterate_breaks(text, break_iterator):
    break_iterator.setText(text)
    lastpos = 0
    while True:
        next_boundary = break_iterator.nextBoundary()
        if next_boundary == -1: return
        yield text[lastpos:next_boundary]
        lastpos = next_boundary

which I just noticed also tripped up other users. Notice that their example of using iterate_breaksΒ passes in a Python string, not an icu.UnicodeString..

We should report it to them at least.

I am guessing the PyICU folks forgot about this when they added auto-conversion of strings to/from the C++ side. Ideally the PyICUs iterators just return the indices into Python strings, but this would break existing users. Probably easiest for them to add a new Python-friendlier collection of iterators, which would also improve perf significantly

I think they should change the current API to just do the right thing. The current behvaior, that it works for BMP not broken for beyond-BMP, is the worst kind of Unicode bug many code has been plagued with, all thanks to UTF-16... I'm fairly confident no one is using PyICU in the broken way and fixing up the indices manually after. So if anything, PyICU fixing the behavior now, is very unlikely to break existing users; more so that it will fix existing users.

At any rate, if you don't mind, please report it to them and link from here for others to find.

Hoping you are finding the grapheme-splitting code amusing. It's around 300 lines of code (including white space and comments). It's in Cython, but basically it's C. In case you missed it, it ends at line 343 in ugrapheme.pyx. I think the approach is novel, simple, fast and hopefully accurate.

So, it's a state-machine hand-coded, right? The closest I've seen to that is Pango's handcoded grapheme code, but using big switch statements. That was horrible.

Your approach is nice, but a bit hard to verify correctness. In HarfBuzz, and elsewhere, I'm a huge fan of the Ragel state-machine generator tool. It's been indispensable. See, for example: https://github.com/google/emoji-segmenter

I also have a speedup idea for you. To change your main switch, _grapheme_split_uint32, into a dictionary mapping state to callbacks:


actions = {
  RULE_BRK: rule_brk,
  RULE_CRLF: lambda cur, nxt: rule_crlf(),
  RULE_PRECORE: lambda cur, nxt: rule_precore(nxt),
  ...
}

cdef inline uint16_t _grapheme_split_uint32(
     uint16_t tran, uint32_t cur, uint32_t nxt) noexcept:
    cdef SplitRule rule = <SplitRule> (tran & 0xff)
    return actions.get(rule, lambda cur, nxt: 0)(cur, next)

Donno if you can make that work efficiently with Cython. Also, you can make all your rule functions take the same arguments, to avoid the overhead of the `lambda`s. Hope that speeds up your code even further!
Z4JC commented 1 month ago

claim that ICU is broken caught my attention. :)

I was careful to only talk about _Py_ICU as opposed to ICU - I should have made that more explicit, although at least it baited you to take a look ;)

Will definitely report to PyICU, thanks! πŸ‘ πŸ‘

As far as the speedup ideas:

Now the question I always ask myself is: to what level you can push the abstraction until the optimizations I mentioned above simply do not happen under gcc, clang or both? I tried to make the rules a simple transcription of the regex-grammar in UAX 29 implemented using the most basic CPU operations, one codepoint at a time with a single 16-bit state.. and then ofcourse see what performance I can get out of all this. I understand that this can get in the way of provability, readability, etc. especially depedning on what stuff the programmer audience is used to reading. But then again my brain still lives in pre-history and this project was just a side quest while coding on my trusty Amiga 500 running on 7.09 MHz and 1MB of RAM.

That said, while it might make some sense to think about my code as a DFA, I like to look at it as a non-backtracking, non-ambigous 1 character lookahead parser with productions (rules) and terminals. Upon encountering a terminal, instead of advancing by 1 character and continuing to consume, I return with an identifier of the next production to be invoked (can be the same production I was parsing). Given that there is no other state it maintains, you can think of it as DFA, but I think of the uint16_t tran as a continuation to be invoked when I want to resume parsing, that is, when I want/can parse the next unicode codepoint. The rule_ functions more/less faithfully follow Table 1b and Table 1c from UAX #29.

behdad commented 1 month ago

Thanks for the very enjoyable read! I oversaw that the if/else block was not running on the Python interpretter, even though I mentioned Cython. And thanks for all the pointers and information.

As for the DFA or recursive-decent parser or continuation, etc, I really love how the same code can be looked at from so many different angles as soon as you accept that code is data...