Add new unidecode_translate method

marcoffee commented 2 years ago

This method behaves similar to unidecode_expect_nonascii, but it uses a preloaded translation dict, built from the xNNN.py files on unidecode folder. This dictionary is, then, fed to str.translate. It throws the same errors as unidecode, but only checks surrogates if the check_surrogates param is True. Since it requires loading the dictionary every initialization (I could not generate a cache for this case), it is slower than unidecode_expect_nonascii for use on the utility, but when used on applications which convert many strings, it is faster.

Here are the results of benchmark.py when run with each configuration (I just replaced the internal calls to each of those methods):

unidecode:

unidecode_expect_ascii, ASCII string
2000000 loops, best of 5: 104 nsec per loop
unidecode_expect_ascii, non-ASCII string
100000 loops, best of 5: 2.8 usec per loop
unidecode_expect_nonascii, ASCII string
100000 loops, best of 5: 2.31 usec per loop
unidecode_expect_nonascii, non-ASCII string
100000 loops, best of 5: 2.46 usec per loop

unidecode_translate with check_surrogates=True

unidecode_expect_ascii, ASCII string
2000000 loops, best of 5: 108 nsec per loop
unidecode_expect_ascii, non-ASCII string
200000 loops, best of 5: 1.77 usec per loop
unidecode_expect_nonascii, ASCII string
200000 loops, best of 5: 1.32 usec per loop
unidecode_expect_nonascii, non-ASCII string
200000 loops, best of 5: 1.36 usec per loop

unidecode_translate with check_surrogates=False

unidecode_expect_ascii, ASCII string
2000000 loops, best of 5: 109 nsec per loop
unidecode_expect_ascii, non-ASCII string
200000 loops, best of 5: 1.21 usec per loop
unidecode_expect_nonascii, ASCII string
500000 loops, best of 5: 796 nsec per loop
unidecode_expect_nonascii, non-ASCII string
500000 loops, best of 5: 862 nsec per loop

It is also faster for big strings, which can be seem on the following benchmark:

In [1]: import unidecode as udec

In [2]: big_str = "ãbç" * 100000

In [3]: %timeit udec.unidecode_expect_nonascii(big_str)
78 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit udec.unidecode_translate(big_str, check_surrogates=False)
7.67 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit udec.unidecode_translate(big_str, check_surrogates=True)
21.5 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Note that the tests located on the tests folder also work for the unidecode_translate method given that check_surrogates=True. Note that the cases where it compares the exception context to None fail (even with the usage of raise ... from None), but it can be easily solved by storing the exception object on a variable and raising if outside the try-catch block.

avian2 commented 2 years ago

Thanks for this pull request. I like the performance increase and I think using str.translate might be interesting for use in Unidecode. I see some minor issues in the code, but they look easy to fix.

However the main issue I have with this change is that it basically duplicates all Unidecode functionality in another function. I don't like having two separate implementations.

I would be interested in exploring the possibility of just replacing the current implementation with a one based on str.translate.

For a long-term running program, preloading the tables shouldn't have much overhead since the current implementation already caches the tables. In the long-term the cache ends up loading all translations anyway. I'm not sure how many people only use Unidecode for short runs though.

Maybe the Translator object for str.translate() can act as a cache/wrapper around the current _get_repl_str()? Perhaps something based on collections.defaultdict? That could end up being very close to the current implementation as far as memory usage is concerned.

horsemankukka commented 1 year ago

Tried using collections.UserDict with __missing__() basically being the _get_repl_str() but adding the missing section directly to the self.data when loading and also caching missing sections separately to a set so None can be returned for those quickly. The performance increase was impressive (also about doubles the performance using benchmark.py), but not sure what would be the most elegant way to handle errors and replace_str here. Didn't do any further testing either, but doing this dynamically seems completely plausible.

from collections import UserDict
from itertools import zip_longest

class UnidecodeCache(UserDict):
    missing_sections = set()

    def __missing__(self, codepoint):
        if codepoint < 0x80:
            # Already ASCII
            raise LookupError()

        if codepoint > 0xeffff:
            # No data on characters in Private Use Area and above.
            return None

        if 0xd800 <= codepoint <= 0xdfff:
            warnings.warn(  "Surrogate character %r will be ignored. "
                            "You might be using a narrow Python build." % (char,),
                            RuntimeWarning, 2)

        section = codepoint >> 8   # Chop off the last two hex digits

        if section in self.missing_sections:
            return None

        try:
            mod = __import__('unidecode.x%03x'%(section), globals(), locals(), ['data'])
        except ImportError:
            # No data on this character
            self.missing_sections.add(section)
            return None

        for k, v in zip_longest(range(256), mod.data):
            self.data[(section << 8) | k] = v

        return self.data[codepoint]

Cache = UnidecodeCache()

# ...

    def _unidecode(string: str, errors: str, replace_str:str) -> str:
        return string.translate(Cache)

Furthermore, initially looks like the performance of unidecode_expect_ascii might improve by using if string.isascii(): return string. At least it shouldn't logically worsen it.

avian2 / unidecode

Add new unidecode_translate method #79