arthurdejong / python-stdnum

A Python library to provide functions to handle, parse and validate standard numbers.
https://arthurdejong.org/python-stdnum/
GNU Lesser General Public License v2.1
484 stars 203 forks source link

Wrong VATIN validation on duplicate country code #420

Open simonphilips opened 8 months ago

simonphilips commented 8 months ago

There seems to be a bug in the VATIN validation when you duplicate the country code:

In [3]: stdnum.vatin.is_valid("BE 0308.357.159")
Out[3]: True

In [4]: stdnum.vatin.is_valid("BE BE 0308.357.159")
Out[4]: True

In [5]: stdnum.eu.vat.is_valid("BE BE 0308.357.159")
Out[5]: False

[4] should have the same result as [5].

This is also happening for Dutch (NL) numbers but not for Mexican (MX) numbers for example.

arthurdejong commented 7 months ago

Hi @simonphilips,

Thanks for reporting this. I've been looking into this and I can't seem to find an elegant solution for this. The problem is that for some countries the country code is considered a valid part of the number. For most of these it is optional while for some others it is required (e.g. the Swiss VAT number). In a lot of other countries the country code is not considered part of the number prefix.

The vatin code currently first tries to validate the number without the country code prefix and if that fails falls back to validating the number with the prefix. This means that the country code may end up being stripped twice.

The only thing I can think of to fix this is to classify each VAT number into categories that define whether they strip a country code prefix. Since we have 70+ numbers keeping this classification up-to-date will be tricky.

Also explicitly checking if the prefix is there twice will not work because some numbers (e.g. the Mexican tax number) can start with letters that match the country code.

I'm willing to look at other suggestions for a solution to this.

simonphilips commented 7 months ago

Hello @arthurdejong,

I understand. I had also looked into the code and it was indeed pretty difficult to come up with a good solution.

My original thought was to do something like this:

def validate(number):
    number = clean(number, '').strip()
    module = _get_cc_module(number[:2])

    # Check if the full number is valid, if so then it's valid.
    try:
        return module.validate(number)
    except ValidationError:
        pass

    # In order for the number to still be considered valid, it must be valid
    # without the country code.
    country = number[:2].upper()
    validated = country + module.validate(number[2:])

    # The module considers number[2:] valid but that could be because of two
    # mutually exclusive reasons:
    #
    #   1. number[2:] is a fully valid number without country code: GOOD
    #   2. number[4:] is a fully valid number without country code and
    #      number[2:4] just duplicates the country code: BAD
    #
    # Therefore, let's check if number[4:] is valid. If so, then number[2:] is
    # only valid because it contains the country code and therefore number[0:]
    # contains the country code twice and is not valid.

    sub = clean(number[2:], '').strip()
    if sub[:2].upper() == country and module.is_valid(sub[2:]):
        raise InvalidFormat()

    return validated

Which works in these cases:

# BE 0308.357.159
validate("BE 0308.357.159")  # valid -> OK

# BE BE 0308.357.159 is not a valid VAT number
validate("BE BE 0308.357.159")  # invalid
validate("BE 0308.357.159")  # valid
validate("0308.357.159")  # valid -> original is invalid -> ERROR

# MXA001101007 is a valid VAT number
validate("MXA001101007")  # valid -> OK

# MX MXA001101007 is a valid VAT number
validate("MX MXA001101007")  # invalid
validate("MXA001101007")  # valid
validate("A001101007")  # invalid -> original is valid -> OK

But unfortunately, while it correctly identifies "BE BE 0308.357.159" as invalid, it still thinks BE.BE 0308.357.159 is valid because periods are stripped in the country's validator, not in vatin itself so it thinks the second country could would be .B instead. I'm not sure how to proceed with this unfortunately.