adah1972 / libunibreak

The libunibreak library
zlib License
173 stars 38 forks source link

Optimize LB30 implementation #43

Closed bbshelper closed 3 months ago

bbshelper commented 3 months ago

Getting east asian width of every char is overkill, causing process time more than doubled. For LB30, the width is only needed when LBP is OP or CP, and it can be further improved by precomputing the ranges.

bbshelper commented 3 months ago

Well, this doesn't take into account language specific overrides in linebreakdef.c...

bbshelper commented 3 months ago

Luckily these overrides with LBP_OP all have N or A east asian width, and the return value of op_is_east_asian() are correct with or without these overrides. I've added a note.

adah1972 commented 3 months ago

I like the optimization, but it is better to keep the code future-safe, and also be sure to follow the Python code style.

adah1972 commented 3 months ago

Getting east asian width of every char is overkill, causing process time more than doubled. For LB30, the width is only needed when LBP is OP or CP, and it can be further improved by precomputing the ranges.

According to the test (make check) speed, the slow-down is not that much. The performance hit is more severe in your use case?

bbshelper commented 3 months ago

According to the test (make check) speed, the slow-down is not that much. The performance hit is more severe in your use case?

I noticed the performance issue during profiling for koreader (an ebook reader). It uses lb_process_next_char() on every char. My report is based on the time spent in this function.

  1. make check on my machine is ~7.0 vs ~7.3ms, but the timing also includes other things like fgets(). If I wrap set_linebreaks_utf32 with a 1000 times loop, the result becomes ~570ms vs ~720ms.
  2. My test files are in English. They are at the top of eaw_prop and thus a worst case scenario for binary search.