arthurdejong / python-stdnum

A Python library to provide functions to handle, parse and validate standard numbers.
https://arthurdejong.org/python-stdnum/
GNU Lesser General Public License v2.1
498 stars 206 forks source link

Add support for Singapore TIN Number #203

Closed unho closed 4 years ago

unho commented 4 years ago

Fixes #111.

unho commented 4 years ago

@arthurdejong Ready for review.

arthurdejong commented 4 years ago

Thanks for the PR. Sad about not being able to get the documentation on the check digit algorithm. However, since there is such a large dataset of valid numbers published it is actually not that hard to reverse engineer the algorithm.

First I looked at the distribution of the check digits across the numbers and found that they were not evenly distributed. However when filtering the numbers by type I found:

This seems to suggest a mod 11 algorithm where the check digit alphabet is different based on the type. When assuming a simple weighted algorithm we can try to guess the weights.

Looking only at the business numbers for now we can generate groups of numbers that only differ in the last (before the check digit) number and check how the last digit changes the check digit:

sames = defaultdict(list)
for number in numbers:
   sames[number[:7] + 'x'].append(number)
complete = [number for number, values in sames.items() if len(values) == 10]
for i in range(5):
    number = random.choice(complete)
    print('%s %s' % (number, ''.join(x[-1] for x in sames[number])))
5286165x AWLJDBXMKE
5336500x CAWLJDBXMK
5310314x CAWLJDBXMK
5313062x LJDBXMKECA
5322613x ECAWLJDBXM

This shows that the check digit alphabet is MKECAWLJDBX or some rotation of it. Continuing to the second digit from right:

sames = defaultdict(list)
for number in numbers:
   sames[number[:6] + 'x' + number[7:8]].append(number)
complete = [number for number, values in sames.items() if len(values) == 10]
for i in range(5):
    number = random.choice(complete)
    print('%s %s' % (number, ', '.join(str(alphabet.index(x[-1])) for x in sames[number])))
530705x5 9, 5, 1, 8, 4, 0, 7, 3, 10, 6
531093x1 6, 2, 9, 5, 1, 8, 4, 0, 7, 3
533736x9 0, 7, 3, 10, 6, 2, 9, 5, 1, 8
528194x1 4, 0, 7, 3, 10, 6, 2, 9, 5, 1
532139x3 6, 2, 9, 5, 1, 8, 4, 0, 7, 3

This shows that every time the x goes up one the check digit goes down by 4, which implies the weight should be 7 (-4 mod 11).

Doing this for every digit (the first digit requires a bit of tweaks because only values from 0 to 5 are found) and shifting the alphabet a bit to get the correct offset we get:

def calc_business_check_digit(number):
    number = compact(number)
    weights = (10, 4, 9, 3, 8, 2, 7, 1)
    return 'XMKECAWLJDB'[sum(int(n) * w for n, w in zip(number, weights)) % 11]

Unleashing this function on the data set I found only 11 numbers where the check digit does not match:

50856857D
52737212B
52803596X
52804404A
52805118K
52813100D
52853385J
52856860B
52870338A
52882019E
52923950C

I have not tried the online validator for these numbers and I haven't looked at the other number types yet but I expect the analysis should be pretty simple to repeat with the approach above (perhaps with some tweaks for the numbers that have letters in them).

unho commented 4 years ago

Wow!!!

unho commented 4 years ago

@arthurdejong I have checked all those numbers that do not match and they all seem to be either terminated or cancelled in 2017. Maybe we should go with this algorithm?

unho commented 4 years ago

Yep, verified with another website and all those are deregistered.

arthurdejong commented 4 years ago

I managed to reverse-engineer the local company and other checksums as well (the last one was a much bigger puzzle because of the letters). That only leaves "Foreign Company" numbers (the ones starting with F000).

Do you have some examples of valid numbers for these? The one in the tests doesn't pass the [online validator](https://www.iras.gov.sg/irashome/GST/GST-registered-businesses/Other-services/Checking-if-a-Business-is-GST-Registered/ and there do not seem to be many references to this flavour. Also note that the "other" flavour has a code (FC) for foreign companies so perhaps it has been replaced?

Do you have some more background and/or examples of valid "Foreign Company" numbers?

Thanks.

unho commented 4 years ago

Sadly I have found no foreign company UEN numbers. If I correctly recall the examples used in testing for foreign companies were made up based on the documentation I have referenced in the ticket, while all the examples for the other types of UEN numbers are real examples.

arthurdejong commented 4 years ago

Are you OK if I merge it without the foreign company UEN numbers? If it is used and some valid numbers are not validated correctly someone will likely complain while no one will likely complain if an invalid number is considered valid.

unho commented 4 years ago

@arthurdejong I am 100% OK with that. I would suggest keeping that particular code, but commented.