Add callback to optionally "repair" fields

mjl commented 3 years ago

I'm not really sure this functionality belongs here, but as the knowledge of the MRZ internal structure is only present in this module, why not... let me know what you think!

I work with scanned MRZ, and as comes with the process, the OCR sometimes mis-reads similar characters. For example, I have seen countries read as "R0U" or a name "SZ0BO5ZLAI". And the MRZ checker correctly warns that the nationality or the identifier is not valid. However, if you could add a method repair() to the checkers

def __init__(self, mrz_code: str, check_expiry=False, compute_warnings=False, precheck=True):
        precheck and check.precheck("TD1", mrz_code, 92)
        lines = mrz_code.splitlines()
        self._document_type = self.repair('document type', lines[0][0: 2])
        self._country = self.repair('country', lines[0][2: 5])
        [...]

def repair(self, field_name: str, field_content: str):
        return field_content

that would allow me to do things like:

class MyChecker(TD1CodeChecker):
    def repair(self, name, content):
        if name in ('country', 'identifier', ...):
            # I know those can only contain alphas
            return self.replace_often_mistaken_numbers_by_alphas(content)

        if name in ('expiry date', 'birth date'):
            return self.replace_often_mistaken_alphas_by_numbers(content)

    def replace_often_mistaken_numbers_by_alphas(self, s):
        return s.replace('5', 'S').replace('1', 'I').replace('0', 'O')

This would make the checker more useful when presented with badly scanned data.

The alternative would be that I somehow preprocess the MRZ, but then I would have to re-implement the MRZ structure definition in my code too. As said above, I'm not a big fan of shoehorning that functionality into this module, but I don't see any other place that has enough knowledge of the MRZ structure.

Arg0s1080 commented 3 years ago

Hi!

Yeah... that functionality should be out of the scope of the project, but heck! Why not? In fact, almost everything in mrz.checker is already off target xDD

Because almost all the project (especially checker) has been done based on requests from others and some ideas of mine (some very bad) now I realize that I should have planned many things differently. Actually i'm trying to fix some of those bad ideas a bit now. Specifically the horrible _Report class

Please give me a few days to finish what I'm doing with checker and we'll see what we can do.

I don't know what you will think, but an option could be add the option to transliterate desired chars with a dict in the same way as in mrz.generator with surnames and given names.

Something like this:

def __init__(self, mrz_code: str, check_expiry=False, compute_warnings=False, ocr_transliteration=None):
    """"
    Params:
        mrz_string           (str):  MRZ string of TD1's. Must be 90 uppercase characters long
        check_expiry        (bool):  If it's set to True, it is verified and reported as warning that the
                                     document is not expired and that expiry_date is not greater than 10 years
        compute_warnings    (bool):  If it's set True, warnings compute as False
        ocr_transliteration (dict):  Transliteration dictionary for OCR purposes. None by default
    """
    [...]

I have some doubts:

~~Should some specific fields be repaired or could it be applied to all mrz code?~~
- oops sorry. I didn't think about it too much. Obviously the repairs must be done depending on the type of field. Purely numeric fields such as dates must convert letters into numbers and fields such as identifier must convert detected numbers to letters.
Usually corrections are always the same for everyone or each person have their specific corrections? I ask this to add a dictionary to the project (or several if there are not many) But there would always be the possibility of using your own external dictionary

EDIT: Oops! SORRY!. I didn't think about it too much. Obviously repairs must be done depending of the field type. Pure-numeric fields such as dates must convert letters into numbers and fields such as identifier, document_type or country must convert detected numbers to letters. What I don't know is what kind of solution you use to repair alphanumeric fields.

It's too late here. Please let me think it a little more calmly. IIf I can't think of anything better, yours might be a good solution.

By the way.. One of the rules for using classes that inherit from TD1CodeChecker, TD2Codechecker, TD1CodeGenerator and and all others is that the class name must start with the document type. For example: TD1MyCodeChecker, TD2OCRChecker, or something like that. Only the following strings: "TD1", "TD2", "TD3", "Passport", "MRVA", "MRVB" are allowed, otherwise document_type will be False (Another thing that I don't like and I must change)

mjl commented 3 years ago

Should some specific fields be repaired or could it be applied to all mrz code?

I guess it makes sense to apply it to all the fields that have constraints on them as to what data they can contain. It probably is not useful to have a callback for "this field may contain anything", but if one knows it is characters only, or digits only, or a date...

Usually corrections are always the same for everyone or each person have their specific corrections? I ask this to add a dictionary to the project (or several if there are not many)

It would probably be the same for everybody, if the MRZ source is the same (ie. if I scan 1000 ID cards, then they probably will all have the same classes of errors).

Your ocr_transliteration dict could be something along the lines:

   {
   'alpha': callback_for_replacing_numbers,
   'digit': callback_for_replacing_chars,
   }

Perhaps having specialisations for 'date' might make sense, and fall back to 'digit' if not present?

I'm partial to having callbacks instead just a static mapping dictionary (1->L, 5->S), but I can live with the static mapping too. The transliteration should run before the hash checks and the other sanity checks.

TanjaBayer commented 3 years ago

This sounds really great, right now for solving that issue is:

use TD1CodeChecker to get the fields
apply some specific functions (a bit more sophisiticated than just replacing values, because often there more than one replacement character)
use the updated fields dict as kwars for the TD1CodeGenerator to generate the mrz again (Problem hier is the outputfields have different names than the input fields e.g. given_names vs names, country vs country_code, which is not that nice, would you accept a merge request for that?)
use the TD1CodeChecker again to now run on the updated mrz

For sure this also applies to TD2 and TD3 and the others.

But still I am wondering if there are still some plans to work on that?

Arg0s1080 commented 3 years ago

I made a commitment to add this feature a long time ago and have not kept my word. I'm not normally like that, but my current circumstances stole me of most of my time.

When @mjl created this issue i thought about giving "a twist to his idea" but the truth is that I do not have the time and the experience in CV to do it.

YES OF COURSE, YOUR PR WILL BE WELCOME and you will have my eternal gratitude :1st_place_medal: . Ideally, it could work for all documents. If you propose a PR we could look at it (if possible and @mjl is not very angry, he could also get involved or at least give his opinion)

Thank you very much in advance

mjl commented 3 years ago

@arg0s Don't worry, we all fall off the train sometimes when life happens.

The feature is on my back burner too at this moment in time, but if anyone has ideas/comments/code, feel free to discuss here!

Arg0s1080 / mrz

Add callback to optionally "repair" fields #24