alvinlindstam / grapheme

A python package for grapheme aware string handling
MIT License
108 stars 7 forks source link

Performance enhancments #1

Open alvinlindstam opened 7 years ago

alvinlindstam commented 7 years ago

The current code is a quick write up which is not very optimized, I just wanted it to be correct and simple to begin with.

alvinlindstam commented 7 years ago

Some improvements in v0.2.1. Could be given more attention

weirman commented 3 years ago

Dear alvinlindstam: Thank you very much for contributing to the library!!! however, I found some problems when I used it, so I updated the following code.

However, I found some problems when I used it, so I updated the following code.You can see on weirman /grapheme I mainly updated the slice() method to make the algorithm faster.In your code, you use loops, so the longer you process the string, the more time it takes. Also, I notice that grapheme_lengths() takes a long time, so I add this parameter to slice(). Finally, I added reverse slicing in this version.You can view and update to the main branch in my branch.

In the 70 million character test, the time to get all the characters using Slice () was reduced from a few weeks to less than 10 minutes,I also passed the whole test.

(graphemeTest) λ py.test ======================================================= test session starts ======================================================== platform win32 -- Python 3.7.0, pytest-6.1.2, py-1.9.0, pluggy-0.13.1 rootdir: E:\PyCode\grapheme collected 2418 items

tests\test_mod.py ........................................................................................................... [ 4%] ............................................................................................................................. [ 9%] ............................................................................................................................. [ 14%] ............................................................................................................................. [ 19%] ............................................................................................................................. [ 25%] ............................................................................................................................. [ 30%] ............................................................................................................................. [ 35%] ............................................................................................................................. [ 40%] ............................................................................................................................. [ 45%] ............................................................................................................................. [ 50%] ............................................................................................................................. [ 56%] ............................................................................................................................. [ 61%] ............................................................................................................................. [ 66%] ............................................................................................................................. [ 71%] ............................................................................................................................. [ 76%] ............................................................................................................................. [ 81%] ............................................................................................................................. [ 87%] ............................................................................................................................. [ 92%] ............................................................................................................................. [ 97%] ............................................................. [100%]

======================================================= 2418 passed in 3.21s =======================================================

alvinlindstam commented 3 years ago

Hey @weirman

Those are some nice improvements! It would be very nice with a PR to this library if we could settle on some trade offs (or avoid them):

Consuming all grapheme lengths vs loop solution

The reason it uses the loop is that it's lazy, so it doesn't compute the grapheme lengths of the full string. So if you have a very long string and want to extract a fairly short prefix, I would expect the current solution to be more efficient than your version. I would expect it to scale linear in regards to the end index (from 0, even if using a non-zero start) while an eager calculation of the lengths would scale linearly in regards to the string length, irrespectively of the slice size.

It would be nice to get some data about the overhead of the loop vs the eager list building, and maybe to a conditional eager scan if we deem it faster than the lazy approach for some heuristics (not doing a prefix).

Negative indexing

I left this out due to the lazy scanning. For negative indices, we'd need to either scan all, or have some way to find the last X graphemes of a string without scanning it all from the very beginning.

Finding graphemes in reverse is difficult, as one needs to backtrack to a position that is guaranteed to be a grapheme break boundary irrespective of previous state. For some strings, it's impossible (such as a long sequence of flags) and would require a backtrack through the full string. I wouldn't mind introducing it with an eager scan of the full list, and maybe open up for optimizations later. Maybe with appropriate docs.

Your solution throws on start > end. I'd prefer it to behave like native slicing, where it returns an empty string (such as for "test"[5:-10]).

Precomputed grapheme lengths

Do you think it's common to want to do several slices on the same strings? I'm a bit hesitant to introduce multiple arguments for the same data. We would depend on the caller to ensure that the grapheme lengths are correct and not messed up with some other string.

I'm wondering if we'd like to introduce a custom type indicating a string with precomputed grapheme data in it, which could support all the operations more efficiently (not just slicing). So maybe something like

>>> input_g = grapheme.precompute(input)
>>> input_g.length()
65432
>>> input_g.slice(0, 100)
"..."
>>> input_g.slice(5000, -100)
"..."

One could even use the native slicing operator then.

That one could need some more planning though. What it should do, when it should return instances of its own type vs when to return str etc.

differentprogramming commented 2 years ago

It's not maximally optimized, but I wrote a C++ library that's similar and based on utf8proc. You could try that.

https://github.com/differentprogramming/Grapheme-String