arthurdejong / python-stdnum

A Python library to provide functions to handle, parse and validate standard numbers.
https://arthurdejong.org/python-stdnum/
GNU Lesser General Public License v2.1
495 stars 205 forks source link

Improve performance using dicts #264

Open marcoffee opened 3 years ago

marcoffee commented 3 years ago

Is there any reason why the default checkers use str.index() instead of a dict?

In this example (taken from here)

# characters used for checksum calculation,
_alphabet = '0123456789ABCDEFGHIJKLMN&OPQRSTUVWXYZ'

def calc_check_digit(number):
    """Calculate the check digit."""
    check = sum(_alphabet.index(c) * (18 - i) for i, c in enumerate(number[:17]))
    return str((10 - check % 10) % 10)

You could avoid the O(N) _alphabet.index(c) cost per character search by creating a dict (or defaultdict to handle -1 cases, if they are necessary), that can search in O(1) in amortized time. Here is the adapted code:

# characters used for checksum calculation,
_alphabet = '0123456789ABCDEFGHIJKLMN&OPQRSTUVWXYZ'
_alphadict = { c : i for i, c in enumerate(_alphabet) }

def calc_check_digit(number):
    """Calculate the check digit."""
    check = sum(_alphadict.get(c, -1) * (18 - i) for i, c in enumerate(number[:17]))
    return str((10 - check % 10) % 10)

In this case, assuming the N is the alphabet size and M is the field size, we reduced the overall complexity of calc_check_digit from O(N * M) to O(M).

This also applies to functions that receive the alphabet as parameter, since building the dictionary once O(N + M) is better than doing a O(N) search for each character O(N * M).

Another (minor) improvement could be using enumerate(itertools.islice(number, 17)) (or zip(range(17), number)) instead of string slicing at number[:17], since python string slicing makes a copy, wasting a little memory and time.

If there is no particular reason to keep using str.index, I can take some time and make a pull request that replaces them with dict.get.

arthurdejong commented 3 years ago

Hi @marcoffee ,

Thanks for your observation. The main reason for the current code structure is simplicity and readability. Also, python-stdnum has supported Python 2.6 for a long time (now only officially supports 2.7+ because testing with 2.6 is too hard to do) which means that some constructs aren't available.

Do you have an idea of the benefit of the suggested change in numbers? Originally python-stdnum was built around the use case of a web application where a handful of numbers were validated and not really optimised for bulk checking of large number of numbers.

Btw, I would personally prefer something along the lines of:

# characters used for checksum calculation,
_alphabet = '0123456789ABCDEFGHIJKLMN&OPQRSTUVWXYZ'
_alphabet = dict((c, i) for i, c in enumerate(_alphabet))

def calc_check_digit(number):
    """Calculate the check digit."""
    check = sum(_alphabet[c] * (18 - i) for i, c in enumerate(number[:17]))
    return str((10 - check % 10) % 10)

I prefer dict[idx] over dict.get(idx, -1) because the first will raise an exception on incorrect numbers instead of silently returning a wrong value (which the current code also does). Since alphabet is a private variable in this case I'm not too worried with changing its type, but alternative names could be alpha_values or alpha_weight or something. For some reason I dislike putting type names in the variable name, probably because it emphasises structure over function or purpose.

marcoffee commented 3 years ago

Hi @arthurdejong

About the performance benefit, I can change some methods and measure them using timeit or other simple tool for timing in python. For some raw initial results, I ran timeit over a random character search on each option (the first line after %%timeit is the setup code and is not considered when measuring runtime):

For the str.index version:

In [1]: import random

In [2]: _alphabet = '0123456789ABCDEFGHIJKLMN&OPQRSTUVWXYZ'

In [3]: %%timeit char = _alphabet[random.randint(0, len(_alphabet))]
   ...: _alphabet.index(char)
115 ns ± 0.671 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

and for the dict.__getitem__ version.

In [1]: import random

In [2]: _alphabet = '0123456789ABCDEFGHIJKLMN&OPQRSTUVWXYZ'

In [3]: _alpha_weight = dict((c, i) for i, c in enumerate(_alphabet))

In [4]: %%timeit char = _alphabet[random.randint(0, len(_alphabet))]
   ...: _alpha_weight[char]
32.8 ns ± 0.658 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

It looks like the dict version is about 3x faster than the str version. Surprisingly enough, I tried this same code, but for the character "0" (the best case of the str version), and the results were close to the ones above!

Since it will not change the code enough to reduce readability and it will address the problem of wrong characters, I think it is worth using the dict. About the naming, I prefer the _alpha_weight version, because naming it as _alphabet would create problems if the original _alphabet is used somewhere else on the code (e.g., to reconstruct the digits from weights).

guyskk commented 2 years ago

@marcoffee How about implement an Alphabet type for better readability?

class Alphabet(object):
    def __init__(self, chars):
        self._map = { c : i for i, c in enumerate(chars) }
    def index(self, key):
        return self._map[key]

_alphabet = Alphabet('0123456789ABCDEFGHIJKLMN&OPQRSTUVWXYZ')
marcoffee commented 2 years ago

@guyskk great idea!

It could have a canonical Alphabet class that support both chars and weights.