Open behdad opened 1 month ago
One can argue it's a PyICU misfortune.
@behdad Feeling pretty flattered that the author of harfbuzz checked my library out π€© TLDR: PyICU runs correctly after the fix, but now more than 8x slower.
Even more unfortunately, the PyICU Cheat Sheet linked to at the very end of the official PyICU README also makes this same mistake:
def iterate_breaks(text, break_iterator):
break_iterator.setText(text)
lastpos = 0
while True:
next_boundary = break_iterator.nextBoundary()
if next_boundary == -1: return
yield text[lastpos:next_boundary]
lastpos = next_boundary
which I just noticed also tripped up other users.
Notice that their example of using iterate_breaks
Β passes in a Python string, not an icu.UnicodeString
..
After your fix, I get the correct output, thank you!
But now, because you have to instantiate a icu.UnicodeString
instance for each string you pass in and then also instantiate a native string instance for each grapheme, PyICU is 8.4 times instead of ~5 times slower:
In [3]: print(','.join(iterate_breaks("Hello π©π½βπ¬! π©πΌββ€οΈβπβπ¨πΎ ΰ€
ΰ€¨ΰ₯ΰ€ΰ₯ΰ€ΰ₯ΰ€¦", bi)
H,e,l,l,o, ,π©π½βπ¬,!, ,π©πΌββ€οΈβπβπ¨πΎ, ,ΰ€
,ΰ€¨ΰ₯,ΰ€ΰ₯ΰ€ΰ₯,ΰ€¦
In [4]: %%timeit
...: list(iterate_breaks("Hello π©π½βπ¬! π©πΌββ€οΈβπβπ¨πΎ ΰ€
ΰ€¨ΰ₯ΰ€ΰ₯ΰ€ΰ₯ΰ€¦", bi))
2.84 ΞΌs Β± 23.5 ns per loop (mean Β± std. dev. of 7 runs, 100,000 loops each)
In [5]: %%timeit
...: grapheme_split("Hello π©π½βπ¬! π©πΌββ€οΈβπβπ¨πΎ ΰ€
ΰ€¨ΰ₯ΰ€ΰ₯ΰ€ΰ₯ΰ€¦")
337 ns Β± 4.1 ns per loop (mean Β± std. dev. of 7 runs, 1,000,000 loops each)
I am guessing the PyICU folks forgot about this when they added auto-conversion of strings to/from the C++ side. Ideally the PyICUs iterators just return the indices into Python strings, but this would break existing users. Probably easiest for them to add a new Python-friendlier collection of iterators, which would also improve perf significantly
Hoping you are finding the grapheme-splitting code amusing. It's around 300 lines of code (including white space and comments). It's in Cython, but basically it's C. In case you missed it, it ends at line 343 in ugrapheme.pyx. I think the approach is novel, simple, fast and hopefully accurate.
@behdad Feeling pretty flattered that the author of harfbuzz checked my library out π€©
Thanks for the kind words. :) Someone brought it up in a group I'm part of, and the claim that ICU is broken caught my attention. :)
I will be recommending your library if someone needs to build a text layout engine in Python.
TLDR: PyICU runs correctly after the fix, but now more than 8x slower.
Even more unfortunately, the PyICU Cheat Sheet linked to at the very end of the official PyICU README also makes this same mistake:
def iterate_breaks(text, break_iterator): break_iterator.setText(text) lastpos = 0 while True: next_boundary = break_iterator.nextBoundary() if next_boundary == -1: return yield text[lastpos:next_boundary] lastpos = next_boundary
which I just noticed also tripped up other users. Notice that their example of using
iterate_breaks
Β passes in a Python string, not anicu.UnicodeString
..
We should report it to them at least.
I am guessing the PyICU folks forgot about this when they added auto-conversion of strings to/from the C++ side. Ideally the PyICUs iterators just return the indices into Python strings, but this would break existing users. Probably easiest for them to add a new Python-friendlier collection of iterators, which would also improve perf significantly
I think they should change the current API to just do the right thing. The current behvaior, that it works for BMP not broken for beyond-BMP, is the worst kind of Unicode bug many code has been plagued with, all thanks to UTF-16... I'm fairly confident no one is using PyICU in the broken way and fixing up the indices manually after. So if anything, PyICU fixing the behavior now, is very unlikely to break existing users; more so that it will fix existing users.
At any rate, if you don't mind, please report it to them and link from here for others to find.
Hoping you are finding the grapheme-splitting code amusing. It's around 300 lines of code (including white space and comments). It's in Cython, but basically it's C. In case you missed it, it ends at line 343 in ugrapheme.pyx. I think the approach is novel, simple, fast and hopefully accurate.
So, it's a state-machine hand-coded, right? The closest I've seen to that is Pango's handcoded grapheme code, but using big switch statements. That was horrible.
Your approach is nice, but a bit hard to verify correctness. In HarfBuzz, and elsewhere, I'm a huge fan of the Ragel state-machine generator tool. It's been indispensable. See, for example: https://github.com/google/emoji-segmenter
I also have a speedup idea for you. To change your main switch, _grapheme_split_uint32
, into a dictionary mapping state to callbacks:
actions = {
RULE_BRK: rule_brk,
RULE_CRLF: lambda cur, nxt: rule_crlf(),
RULE_PRECORE: lambda cur, nxt: rule_precore(nxt),
...
}
cdef inline uint16_t _grapheme_split_uint32(
uint16_t tran, uint32_t cur, uint32_t nxt) noexcept:
cdef SplitRule rule = <SplitRule> (tran & 0xff)
return actions.get(rule, lambda cur, nxt: 0)(cur, next)
Donno if you can make that work efficiently with Cython. Also, you can make all your rule functions take the same arguments, to avoid the overhead of the `lambda`s. Hope that speeds up your code even further!
claim that ICU is broken caught my attention. :)
I was careful to only talk about _Py_ICU as opposed to ICU - I should have made that more explicit, although at least it baited you to take a look ;)
Will definitely report to PyICU, thanks! π π
As far as the speedup ideas:
if
/elif
form and convert it into a switch, as the optimize.use_switch = True
by default (see here). To convince yourself, you can try to git clone
the tree and python setup.py build_ext --inplace
and check out the generated ugraphemes.c
inline
keyword meaning they will be compiled as static
inside the C-module and thus invisible outside the translation unit. This makes them ripe for inlining and indeed this happens. You can verify with objdump -S
Β on the .so
file that all these rule_
function actually get inlined into the grapheme_split_uint32
function. ugraphemes.c
). I elided cur
and/or nxt
on some calls to help readability (the rules indeed do not always need context) and the generated optimized inline code will for smaller rule deal away with any idea of a function call, arguments or return value in the actual generated assembly.brk
, crlf
, precore
, prepend
, postcore
, ri
, and emoji_*
rules/productions are going to be completely inlined while hangul
will be taken care of in a separate function as well all of the Devanagari consonant rules, which will be merged into another separate combined function. All of the is_
predicates will be inlined as well. The "hinting" I would do with some hand rolled jump tables would just get in the way and slow the GCC/Clang in doing their thing (as discussed in the above articles I linked).Now the question I always ask myself is: to what level you can push the abstraction until the optimizations I mentioned above simply do not happen under gcc
, clang
or both? I tried to make the rules a simple transcription of the regex-grammar in UAX 29 implemented using the most basic CPU operations, one codepoint at a time with a single 16-bit state.. and then ofcourse see what performance I can get out of all this. I understand that this can get in the way of provability, readability, etc. especially depedning on what stuff the programmer audience is used to reading. But then again my brain still lives in pre-history and this project was just a side quest while coding on my trusty Amiga 500 running on 7.09 MHz and 1MB of RAM.
That said, while it might make some sense to think about my code as a DFA, I like to look at it as a non-backtracking, non-ambigous 1 character lookahead parser with productions (rules) and terminals. Upon encountering a terminal, instead of advancing by 1 character and continuing to consume, I return with an identifier of the next production to be invoked (can be the same production I was parsing). Given that there is no other state it maintains, you can think of it as DFA, but I think of the uint16_t tran
as a continuation to be invoked when I want to resume parsing, that is, when I want/can parse the next unicode codepoint. The rule_
functions more/less faithfully follow Table 1b and Table 1c from UAX #29.
Thanks for the very enjoyable read! I oversaw that the if/else block was not running on the Python interpretter, even though I mentioned Cython. And thanks for all the pointers and information.
As for the DFA or recursive-decent parser or continuation, etc, I really love how the same code can be looked at from so many different angles as soon as you accept that code is data...
On the front page, the code claiming ICU is buggy is doing it wrong. ICU returns indices as UTF-16 or UTF-8 indices, not "character" indices like Python expects. Here is a fix: