Open marcoffee opened 3 years ago
Hi @marcoffee ,
Thanks for your observation. The main reason for the current code structure is simplicity and readability. Also, python-stdnum has supported Python 2.6 for a long time (now only officially supports 2.7+ because testing with 2.6 is too hard to do) which means that some constructs aren't available.
Do you have an idea of the benefit of the suggested change in numbers? Originally python-stdnum was built around the use case of a web application where a handful of numbers were validated and not really optimised for bulk checking of large number of numbers.
Btw, I would personally prefer something along the lines of:
# characters used for checksum calculation,
_alphabet = '0123456789ABCDEFGHIJKLMN&OPQRSTUVWXYZ'
_alphabet = dict((c, i) for i, c in enumerate(_alphabet))
def calc_check_digit(number):
"""Calculate the check digit."""
check = sum(_alphabet[c] * (18 - i) for i, c in enumerate(number[:17]))
return str((10 - check % 10) % 10)
I prefer dict[idx]
over dict.get(idx, -1)
because the first will raise an exception on incorrect numbers instead of silently returning a wrong value (which the current code also does). Since alphabet is a private variable in this case I'm not too worried with changing its type, but alternative names could be alpha_values
or alpha_weight
or something. For some reason I dislike putting type names in the variable name, probably because it emphasises structure over function or purpose.
Hi @arthurdejong
About the performance benefit, I can change some methods and measure them using timeit
or other simple tool for timing in python.
For some raw initial results, I ran timeit
over a random character search on each option (the first line after %%timeit
is the setup code and is not considered when measuring runtime):
For the str.index
version:
In [1]: import random
In [2]: _alphabet = '0123456789ABCDEFGHIJKLMN&OPQRSTUVWXYZ'
In [3]: %%timeit char = _alphabet[random.randint(0, len(_alphabet))]
...: _alphabet.index(char)
115 ns ± 0.671 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
and for the dict.__getitem__
version.
In [1]: import random
In [2]: _alphabet = '0123456789ABCDEFGHIJKLMN&OPQRSTUVWXYZ'
In [3]: _alpha_weight = dict((c, i) for i, c in enumerate(_alphabet))
In [4]: %%timeit char = _alphabet[random.randint(0, len(_alphabet))]
...: _alpha_weight[char]
32.8 ns ± 0.658 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
It looks like the dict
version is about 3x faster than the str
version. Surprisingly enough, I tried this same code, but for the character "0"
(the best case of the str
version), and the results were close to the ones above!
119 ns ± 0.269 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
for the str
34 ns ± 0.317 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
for the dict
Since it will not change the code enough to reduce readability and it will address the problem of wrong characters, I think it is worth using the dict
.
About the naming, I prefer the _alpha_weight
version, because naming it as _alphabet
would create problems if the original _alphabet
is used somewhere else on the code (e.g., to reconstruct the digits from weights).
@marcoffee How about implement an Alphabet type for better readability?
class Alphabet(object):
def __init__(self, chars):
self._map = { c : i for i, c in enumerate(chars) }
def index(self, key):
return self._map[key]
_alphabet = Alphabet('0123456789ABCDEFGHIJKLMN&OPQRSTUVWXYZ')
@guyskk great idea!
It could have a canonical Alphabet class that support both chars and weights.
Is there any reason why the default checkers use
str.index()
instead of adict
?In this example (taken from here)
You could avoid the
O(N)
_alphabet.index(c)
cost per character search by creating adict
(ordefaultdict
to handle-1
cases, if they are necessary), that can search inO(1)
in amortized time. Here is the adapted code:In this case, assuming the
N
is the alphabet size andM
is the field size, we reduced the overall complexity ofcalc_check_digit
fromO(N * M)
toO(M)
.This also applies to functions that receive the alphabet as parameter, since building the dictionary once
O(N + M)
is better than doing aO(N)
search for each characterO(N * M)
.Another (minor) improvement could be using
enumerate(itertools.islice(number, 17))
(orzip(range(17), number)
) instead of string slicing atnumber[:17]
, since python string slicing makes a copy, wasting a little memory and time.If there is no particular reason to keep using
str.index
, I can take some time and make a pull request that replaces them withdict.get
.