Arg0s1080 / mrz

Machine Readable Zone generator and checker for official travel documents sizes 1, 2, 3, MRVA and MRVB (Passports, Visas, national id cards and other travel documents)
GNU General Public License v3.0
339 stars 124 forks source link

Belgian ID Cards: Issue with Document Numbers Exceeding 9 Characters in TD1 #4

Open Removed-5an opened 5 years ago

Removed-5an commented 5 years ago

Hi there, thank you for making this library!

I have an issue with TD1, specifically scanning Belgian ID cards. If the document_number_hash digit is "<" the document will not verify.

I have checked this with 3 different Belgian ID cards and they all have "<" on index 14 of line 0.

After a ton of googling and reading specs I found an issue with the way you check document_number_hash...

Normally a document number starts at position 5 and ends at position 13 but sometimes a document number exceeds the size of it's slot and optional fields will be used, let's take a look at this example:

IDBEL123456789<1233<<<<<<<<<<<

In this case the document number check is < when we have a scenario like that we need to look at the optional numbers (1233). So when the document number check is < we need to look at the last none empty value: 3. This is the actual hash number. After that we simply verify the hash of:

Document Number: 123456789<123 Hash: 3

And this should verify as True using your verify function.

from string import ascii_uppercase, digits

def hash_string(string: str) -> str:
    """
    >>> hash_string("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
    '7'
    >>> hash_string("0123456789")
    '7'
    >>> hash_string("0123456789ABCDEF")
    '0'
    >>> hash_string("0")
    '0'
    """
    printable = digits + ascii_uppercase
    string = string.upper().replace("<", "0")
    weight = [7, 3, 1]
    summation = 0
    for i in range(len(string)):
        c = string[i]
        if c not in printable:
            raise ValueError("%s contains invalid characters" % string, c)
        summation += printable.index(c) * weight[i % 3]
    return str(summation % 10)

print(hash_string("123456789<123"))
Arg0s1080 commented 5 years ago

Hi @5an1ty !

It seems that Belgium has 3 types of ID Cards: image

The first one complies with ICAO specs (Belgian Citizens). The other two (Kids and Foreigners) work as you have explained, so It seems that Belgium is "twisting" the ICAO specifications.

According ICAO 9303-5 (TD1) 4.2.2.1, MRZ chars position will be 6 to 14 in line 1 for document number and 15 for document number hash, so this case is outside the scope of mrz. However, these problems can usually be solved simply by overwriting some property (see issue #3). In this case, overwriting document_number property.

A possible solution could be:

#!/usr/bin/python3
# -*- coding: utf-8 -*-

from mrz.checker.td1 import *
from mrz.base.functions import hash_is_ok

class TD1BELCodeChecker(TD1CodeChecker):
    @property
    def document_number_hash(self) -> bool:
        """Return True if the hash of the document number is validated, False otherwise."""
        if self._document_number_hash == "<":
            doc_number_fin = self._optional_data.rstrip("<")
            self._document_number = self._document_number + "<" + doc_number_fin[:-1]
            self._document_number_hash = doc_number_fin[-1]
        return self._report("document number hash", hash_is_ok(self._document_number, self._document_number_hash))

Usage:

from mrz.checker.td1_belgian import TD1BELCodeChecker

# CASE 1
code_citizens = ("IDBEL590330101085020100200<<<<\n"
                 "8502016F0901015BEL<<<<<<<<<<<8\n"
                 "VAN<DER<VELDEN<<GREET<HILDE<<<")

# CASE 2
mrz_code_kids = ("IDBEL000610035<7017<<<<<<<<<<<\n"
                 "0002015F0910190BEL000201002003\n"
                 "MAES<<SOPHIE<ANN<G<<<<<<<<<<<<")

td1_check_citz = TD1BELCodeChecker(code_citizens)
print("CASE 1:%s" % td1_check_citz)

td1_check_kids = TD1BELCodeChecker(mrz_code_kids)
print("CASE 2:%s" % td1_check_kids)

# CASE 3: Let's change document number hash
mrz_code_kids = ("IDBEL000610035<7010<<<<<<<<<<<\n"
                 "0002015F0910190BEL000201002003\n"
                 "MAES<<SOPHIE<ANN<G<<<<<<<<<<<<")

td1_check_kids = TD1BELCodeChecker(mrz_code_kids)
print("CASE 3:%s" % td1_check_kids)
print("FALSES CASE 3:")
print(td1_check_kids.report_falses)

Output:

CASE 1:True
CASE 2:True
CASE 3:False
FALSES CASE 3:
[('final hash', False), ('document number hash', False)]

This solution is valid for the 3 types of Belgian ID Cards. It's a very quick solution, so, I'm sure it can be improved. For example, if you want to report children and foreigners id cards as a warning:

    @property
    def document_number_hash1(self) -> bool:
        """Return True if the hash of the document number is validated, False otherwise."""
        ok = True
        if self._document_number_hash == "<":
            doc_number_fin = self._optional_data.rstrip("<")
            self._document_number = self._document_number + "<" + doc_number_fin[:-1]
            self._document_number_hash = doc_number_fin[-1]
            self._report("Possible Kids or Foreigners ID Card", kind=1)
            ok = not self._compute_warnings
        return self._report("document number hash",
                            ok and hash_is_ok(self._document_number, self._document_number_hash))

Output:

CASE 2:True
WARNINGS CASE 2:
['Possible Kid or Foreigner ID Card']

I hope I've helped.

Regards.

PS: I'm thinking that maybe it could be a good idea to create a folder to store all these special cases outside of ICAO specs

Removed-5an commented 5 years ago

Hi, thank you for replying!

It helps me a lot! However your explanation is not fully correct.

I have verified 3 full Belgian eID cards (not kids or foreigners) and they also don't follow the actual ICAO 9303-5 (TD1) 4.2.2.1 spec. They have the same exception as the kids and foreigners cards like you describe above. I guess it's mostly newer cards that have a high enough document number.

It would be nice indeed to also support special cases and have them in another folder.

By the way: Unrelated to this issue, but it would be great if there was a function in your library that returns a dict of the parsed mrz.

Arg0s1080 commented 5 years ago

Hi again!

I understand.. I'm from Spain and we also have 2 types of cards. In the old cards the national identification number is the document_number field, in the new cards that number is assigned to optional_data field and the document_number field is occupied by the number of the physical support of the cards (a real mess!)

I dont know if it's what you're looking for, but the library has several methods to report the result. For example, continuing with the previous example:

# CASE 3: Let's change document number hash
mrz_code = ("IDBEL000610035<7010<<<<<<<<<<<\n"
            "0002015F0910190BEL000201002003\n"
            "MAES<<SOPHIE<ANN<G<<<<<<<<<<<<")
td1_check = TD1BELCodeChecker(mrz_code)

print("CASE 3:%s" % td1_check)
print("\nList of tuples with all the fields analyzed:")
print(td1_check.report)

if bool(td1_check) == False:
    print("\nList of tuples (same as above but only returns Falses):")
    print(td1_check.report_falses)
    print("\nList with errors:")  # I've never liked it (it's possible that I can change or eliminate it)
    print(td1_check.report_errors)
    print("\nList with warnings:")  # same as above
    print(td1_check.report_warnings)

for field, result in td1_check.report:
    print(field.title().ljust(30, "."), result)

Output:

CASE 3:False

List of tuples with all the fields analyzed:
[('final hash', False), ('document number hash', False), ('birth date hash', True), ('expiry date hash', True), ('document type format', True), ('valid country code', True), ('valid nationality code', True), ('birth date', True), ('expiry date', True), ('valid genre format', True), ('identifier', True), ('document number format', True), ('optional data format', True), ('optional data 2 format', True)]

List of tuples (same as above but only returns Falses):
[('final hash', False), ('document number hash', False)]

List with errors:
['false final hash', 'false document number hash']

List with warnings:
['Possible Kid or Foreigner ID Card']

Final Hash.................... False
Document Number Hash.......... False
Birth Date Hash............... True
Expiry Date Hash.............. True
Document Type Format.......... True
Valid Country Code............ True
Valid Nationality Code........ True
Birth Date.................... True
Expiry Date................... True
Valid Genre Format............ True
Identifier.................... True
Document Number Format........ True
Optional Data Format.......... True
Optional Data 2 Format........ True
imanenter commented 4 years ago

hi Arg0s1080 thank u for ur nice code i have same problem in generating belguim id card mrz as u know Document number is 12 numbers and this app doesent accept it for ex 000590448 301 i tried to put ">301" in first optional data but first check number will go next to 8 ( it should be next to 1) IDBEL0005904480<301<<<<<<<<<<< this is check number whats ur idea about it how to generate belgium id card mrz code qazq

Arg0s1080 commented 4 years ago

hi Arg0s1080 thank u for ur nice code i have same problem in generating belguim id card mrz as u know Document number is 12 numbers and this app doesent accept it for ex 000590448 301 i tried to put ">301" in first optional data but first check number will go next to 8 ( it should be next to 1) IDBEL0005904480<301<<<<<<<<<<< this is check number whats ur idea about it how to generate belgium id card mrz code qazq

Hi, whats up!

This issue was solved with a "special case". ItΒ΄s possible to check Belgian id cards with this class, but I think there is nothing to generate its mrz code.

I'm very busy right now. However let me re-study this issue again and when I have a little free time I will try to find a solution.

BR

Arg0s1080 commented 4 years ago

Advance:

Hi again @imanenter

Although the problem is not solved, i know how Belgian ID card 'mechanism' works.

Taking your picture and two from above:

from mrz.generator.td1 import TD1CodeGenerator

# 000590448 301
print(TD1CodeGenerator("ID",              # Document type
                       "Belgium",         # Country
                       "000590448",       # Document number
                       "850101",          # Birth date
                       "F",               # Genre
                       "170203",          # Expiry date
                       "Belgium",         # Nationality
                       "Le Meunier",      # Surname
                       "Jennifer Anne",   # Given name(s)
                       "3016",            # Optional data 1
                       "85010100200"))    # Optional data 2

# 000610035 7017                          
print(TD1CodeGenerator("ID",              # Document type
                       "Belgium",         # Country
                       "000610035",       # Document number
                       "000201",          # Birth date
                       "F",               # Genre
                       "091019",          # Expiry date
                       "Belgium",         # Nationality
                       "Maes",            # Surname
                       "Sophie Ann G",    # Given name(s)
                       "7017",            # Optional data 1
                       "00020100200"))    # Optional data 2

# B10032650 08                            
print(TD1CodeGenerator("ID",              # Document type
                       "BEL",             # Country
                       "B10032650",       # Document number
                       "821020",          # Birth date
                       "F",               # Genre
                       "060131",          # Expiry date
                       "New Zealand",     # Nationality
                       "Flores",          # Surname
                       "Gema Caroline J", # Given name(s)
                       "08",              # Optional data 1
                       "82102008472"))    # Optional data 2

I got this output

IDBEL00059044803016<<<<<<<<<<<
8501019F1702035BEL850101002007
LE<MEUNIER<<JENNIFER<ANNE<<<<<

IDBEL00061003507017<<<<<<<<<<<
0002015F0910190BEL000201002003
MAES<<SOPHIE<ANN<G<<<<<<<<<<<<

IDBELB10032650008<<<<<<<<<<<<<
8210209F0601315NZL821020084722
FLORES<<GEMA<CAROLINE<J<<<<<<<

The result is (almost) correct:

image image

As you can see, it has only been necessary set document_number with the first part, set optional_number_1 with the second part and force document_number_hash with 0 string

It would only be necessary to disable document_number_hash using < string

All of this takes a long time. In another free time i will continue working with it

Regards

imanenter commented 4 years ago

hi Arg0s1080 thank u so much for ur help yes it works thank u and best regards <3

Arg0s1080 commented 4 years ago

Hi again @imanenter

Another weekend 😊...

There are many ways to solve the problem. I chose the way that I think is most correct.

There is still "polishing" some small detail to finish, but it is functional.

Draft:

Taking your picture and two from above:

from mrz.special_cases.generator.belgium_id_card import TD1BELCodeGenerator

# 000590448 301
print(TD1BELCodeGenerator("ID",              # Document type
                          "Belgium",         # Country
                          "000590448 301",       # Document number
                          "850101",          # Birth date
                          "F",               # Genre
                          "170203",          # Expiry date
                          "Belgium",         # Nationality
                          "Le Meunier",      # Surname
                          "Jennifer Anne",   # Given name(s)
                          "",                # Optional data 1: This field is null. I still have to think what to do with it
                          "85010100200"))    # Optional data 2
print()
# 000610035 701 7
print(TD1BELCodeGenerator("ID",              # Document type
                          "Belgium",         # Country
                          "000610035 701",       # Document number
                          "000201",          # Birth date
                          "F",               # Genre
                          "091019",          # Expiry date
                          "Belgium",         # Nationality
                          "Maes",            # Surname
                          "Sophie Ann G",    # Given name(s)
                          "blahblah",                # Optional data 1. Canceled
                          "00020100200"))    # Optional data 2
print()
# B10032650 0 8
print(TD1BELCodeGenerator("ID",              # Document type
                          "BEL",             # Country
                          "B100326500",        # Document number
                          "821020",          # Birth date
                          "F",               # Genre
                          "060131",          # Expiry date
                          "New Zealand",     # Nationality
                          "Flores",          # Surname
                          "Gema Caroline J", # Given name(s)
                          "",              # Optional data 1. CANCELLED
                          "82102008472"))    # Optional data 2

Output:

IDBEL000590448<3016<<<<<<<<<<<
8501019F1702035BEL850101002007
LE<MEUNIER<<JENNIFER<ANNE<<<<<

IDBEL000610035<7017<<<<<<<<<<<
0002015F0910190BEL000201002003
MAES<<SOPHIE<ANN<G<<<<<<<<<<<<

IDBELB10032650<08<<<<<<<<<<<<<
8210209F0601315NZL821020084722
FLORES<<GEMA<CAROLINE<J<<<<<<<

Result is (totally) correct

image image

As you can see above, document_number field accepts 3 formats: (With your sample)

  1. "000590448301"
  2. "000590448<301"
  3. "000590448 301"

The hash is calculated automatically.

I also want to include the ability to add the hash manually:

  1. "0005904483016"
  2. "000590448<3016"
  3. "000590448 3016"

BR

imanenter commented 4 years ago

wooooowww thank u vvvveeeeery much, its cool i really appreciate it mannn

vamshi-7 commented 3 years ago

@Arg0s1080 Is there any other country that does not follow the TD1 format rather than Belgium..??

Arg0s1080 commented 3 years ago

Hi there, @vamshi-7

The problem with TD1 format is that are used by countries as national Id cards, driver's licenses or other non-international documents, so, it's very probale that there are many countries that do not strictly comply with ICAO specs.That's why there are usually fewer problems with passports and visas.

Someone long ago reported a problem with German id cards and a special case was created, but surely Belgium and Germany are not the only countries that "break or twist" specs.

Why you ask?

BR

vamshi-7 commented 3 years ago

Hey @Arg0s1080 ,

Firstly, thank you for the reply. I am student from uni-koblenz, currently working as an intern. My research is on to extract the text from the travel docs. As far as now from my limited experience, all TD3 type docs are maintaining proper specs except Germany. I am confused with TD1 type after seeing this belgium cards.

But, many other countries can only break or twist the first line specs in the MRZ region? As I see, apart from Germany many other country are not twisting the specs w.r.t the second and third lines. Please correct me if am wrong.

Moreover, apart from google, any other open-sources to obtain this images dataset.

BR

Arg0s1080 commented 3 years ago

Hi again,

I'm glad and I hope everything goes well for you!! In reality problems should not exist. The specs are unobjectionable (strict enough and flexible enough). Problems usually appear when "national data" is moved to document_number, optional_data and optional_data_2 fields (as in this issue of Belgium), but it is rare to find such problems in passports (TD3's) and visas (TD2's).

I highly doubt that you will find a good dataset to train a neural network or massively test a project. Think that it is private and very sensitive data (that's why this project has been in beta for years). I know there have been students who have used mrz.generator to train a NN, so I guess they didn't find a better option.

Why do you say that Germany does maintaning proper ICAO specs? Is it because of its country code ("D": only one letter) or another reason?

BR

vamshi-7 commented 3 years ago

Hi,

yes, It's difficult to find the data even to test the algorithm, especially for Belgium cards. And I mean about the Germany's country code specs.

Thanks and BR.

typelogic commented 3 years ago

Given these two TD1 MRZ values:

IDSLV0012345678<<<<<<<<<<<<<<<
9306026F2708252SLV<<<<<<<<<<<4
JOHN<SMEAGOL<<WENDY<LIESSETTEF

And then another one,

IDSLVOO12345678<<<<<<<<<<<<<<<
9306026F2708252SLV<<<<<<<<<<<4
JOHN<SMEAGOL<<WENDY<LIESSETTEF

When scanned via OCR, it can read either 0012345678 or OO12345678 and still pass all check digits checks. Now, which is which?

accensi commented 2 years ago

Acccoding to lat edition of ICAO 9303, in Part 5, there is an explanation in how to compute the DV when the document number exceeds the original field size: https://www.icao.int/publications/Documents/9303_p5_cons_en.pdf Part 5. Specifications for TD1 Size Machine Readable Official Travel Documents (MROTDs) 4.2.4 Check digits in the MRZ The method of calculating check digits is given in Doc 9303-3. For the TD1, the data structure of the machine readable lines in Paragraph 4.2.2 provides for the inclusion of four check digits as follows: Check digit Character positions (upper MRZ line) used to calculate check digit Check digit position (upper MRZ line) Document number check digit 6 – 14 15 check digit or Long document number check digit 6 – 14, 16 – 28 Note: Position 15 contains β€˜<’ and is excluded from the check digit calculation. The position of the last digit of a long document number is in the range of 16 – 28. 17 – 18 (one digit only) Note: Since the check digit follows the last digit of the document number, its position is in the range of 17 – 29. The check digit is followed by β€˜<’.