Update identifier validation in _fields.py

Hc747 commented 3 years ago

Addresses use case where the first line of a valid TD3 MRZ is structured as so:

P<XXXAA<<BBBBBB<<CCCCC<DD<<<<<<<<<<<<<<<<<<<

Whereby the 'A' component is the primary identifier (surname) and the 'B', 'C' and 'D' components are the secondary identifier (name).

Hc747 commented 3 years ago

Please let me know if you're okay with merging this @Arg0s1080. Hope all is well on your end!

Arg0s1080 commented 3 years ago

Please let me know if you're okay with merging this @Arg0s1080. Hope all is well on your end!

Hi!

Hello

I promise to try to review it this weekend

Hc747 commented 3 years ago

Please let me know if you're okay with merging this @Arg0s1080. Hope all is well on your end!

Hi!

Hello

I promise to try to review it this weekend

Thank you - no rush! Let me know if there's anything you'd like changed!

Hc747 commented 3 years ago

Bump @Arg0s1080 :p

Arg0s1080 commented 3 years ago

Hi again:

Sorry, I'm should have said "next weekend"

I dont know if i understand your problem well, but....

The approach you propose is no-valid.

ICAO specs say:

9303-3

4.6 Convention for Writing the Name of the Holder

[...]

The primary identifier, using the Latin character transliteration (if applicable), shall be written in the MRZ as specified in the form factor specific Parts 4 to 7 of Doc 9303. The primary identifier shall be followed by two filler characters (<<). The secondary identifier, using the Latin character transliteration (if applicable), shall be written starting in the character position immediately following the two filler characters.

If the primary or secondary identifiers have more than one name component, each component shall be separated by a single filler character (<).

Filler characters (<) should be inserted immediately following the final secondary identifier (or following the primary identifier in the case of a name having only a primary identifier) through to the last character position in the machine readable line.

So.. following your sample, its structure should be:

P<XXXAA<<BBBBBB<CCCCCC<DD<<<<<<<<<<<<<<<<<<<

instead: P<XXXAA<<BBBBBB<<CCCCC<DD<<<<<<<<<<<<<<<<<<<

For example:

More than 2 identifiers: Primary: AA Secondary: BBBBBB Tertiary: CCCCCC DD

#!/usr/bin/python3
# -*- coding: UTF-8 -*-

from mrz.checker.td3 import TD3CodeChecker

check = TD3CodeChecker("P<XXXAA<<BBBBBB<<CCCCC<DD<<<<<<<<<<<<<<<<<<<\n"
                       "ZE000509<9XXX8501019F2301147<<<<<<<<<<<<<<08")

print("Result:")
print(bool(check))

print()
print("Detected errors:")
errors = check.report.errors
if len(errors) > 0:
    print(check.report.errors)
else:
    print("None")

Output:

Result: False

Detected errors:
['more than two identifiers', 'false identifier']

If we repair the full name using only 2 identifiers:

Primary: AA Secondary: BBBBBB CCCCC DD

from mrz.checker.td3 import TD3CodeChecker

check = TD3CodeChecker("P<XXXAA<<BBBBBB<CCCCCC<DD<<<<<<<<<<<<<<<<<<<\n"
                       "ZE000509<9XXX8501019F2301147<<<<<<<<<<<<<<08")

print("Result:")
print(bool(check))

print()
print("Detected errors:")
errors = check.report.errors
if len(errors) > 0:
    print(check.report.errors)
else:
    print("None")

Output:

Result:
True

Detected errors:
None

Sorry for the delay and BR

PS: If I have understood something bad tell me

Hc747 commented 3 years ago

@Arg0s1080 I don't believe you've misunderstood anything! :) Strange however, because I've received an official passport document that does not adhere to this standard and therefore cannot be parsed by this library. The document was in the format specified in the original post of this PR.

Arg0s1080 commented 3 years ago

It's pretty weird. ICAO specs are quite flexible and leave many things at the discretion of the issuing State, but others are very strict. Its also not very rare to find organizations that do not meet specs.

Out of curiosity, may I know what country it is?

You can modify the code however you want, but the correct thing would be to add a "special case" (India had a similar problem... I think I remember that there were identifiers that started with << or something like that) creating a class that overwrite "the official ones"

Due to a design problem, the format of the class name must:

Start with TD1, TD2, TD3 OR Passport
Finish with "CodeChecker" or CodeGenerator

For example TD1MyNewClassCodeGenertor or PassportOtherClassNameCodeChecker

Hc747 commented 3 years ago

Thanks very much for the feedback and solution; will go with that approach. The document was an Indonesian passport document.

Arg0s1080 / mrz

Update identifier validation in _fields.py #29