ZsBT / mrz-java

Machine-Readable Zone parser for Java
70 stars 40 forks source link

A MrzParseException is thrown when the date fields are not parseable #15

Closed GoUpNorth closed 6 years ago

GoUpNorth commented 6 years ago

When parsing the following MRZ, mrz-java throws an MrzParseException and stops the parsing: "P<GBRUK<SPECIMEN<<ANGELA<ZOE<<<<<<<<<<<<<<<<" "9250764733GBRBB09157F2007162<<<<<<<<<<<<<<08" unparseable date of birth

"P<GBRUK<SPECIMEN<<ANGELA<ZOE<<<<<<<<<<<<<<<<" "9250764733GBR8809117F20HH162<<<<<<<<<<<<<<08" unparseable date of expiry

I suggest that the library continues the parsing, keeps track of the raw text that was unparseable and send back a MrzModel to the caller.

ZsBT commented 6 years ago

Well, I don't think the issue is in the scope of this project. The root cause is the improper previous OCR process, e.g. digit "8" is often recognized as "B". I recommend the following procedure that works in my environment: When parse fails, replace unstable characters (8<=>B, M<=>H, 2<=>Z etc) while exception is thrown or check digit validation fails. For certain reasons, precise name parsing is also important: imagine the situation where the above-mentioned name appears as SPEO1HEN ANGEL4 20E. Of course, this could be handled within this project. However, as this requires massive work, we'd need a "corrector" class that tries to parse the input string several times until all fields' check digit gets valid (keeping an eye on that check digit can be also recognized wrong).

jaaufauvre commented 6 years ago

The example above ("BB0915") is a particular case where indeed the caller can implement some rules as you explained (8 <=> B, ...), and call the library again with a "better" MRZ.

But there will still remain cases where you can't fix the MRZ completely.

In that cases you won't have any information at all because the library has stopped everything and has thrown an Exception without any result.

You could just return null objects instead.

Looking at the code it seems it's already the case sometimes (partial results instead of Exception), so it could be extended to all the fields and all reasons of parsing errors.

Unless you want the library to be a parser of valid MRZ only.

Alex.

ZsBT commented 6 years ago

Indeed, an "Exception without any result" does not help uncovering the problematic part of the MRZ line. What are the ideas for the behavior when the date is totally unparseable? Leaving that property null? That sounds reasonable. Shall you have a solution already, I am always open for pull requests (please use the dev branch).

Meantime, as we are talking about error handling, I am getting closer to plan that "corrector" class.

GoUpNorth commented 6 years ago

I will make a pull request that leaves the property to null if the date parsing fails.

We could also set the day, month or year property of the date to -1, when it specifically fails on that element, and try to parse the other element of the date. That way the MrzModel could be returned with a partially parsed date. For the MRZ formatted date "BB0915", it would give something like that: model.dateOfBirth.year = -1 model.dateOfBirth.month = 9 model.dateOfBirth.day = 15

ZsBT commented 6 years ago

Good workaround. With this, we can get the most data as possible. Can I ask you to also take care of the validity flags, I mean to set all applicable ones (even overall) to false. That helps code users to know something is wrong with the MRZ lines.

GoUpNorth commented 6 years ago

What do yo mean about "all the applicable validity flags" ? Because if the date is not parseable because of some OCR failure, the check digit calculation will fail, no ?

ZsBT commented 6 years ago

Exactly, so the code should ensure that in MrzRecord, relevant booleans "valid*" gets false.

On 2017. szept. 16., Szo 19:38 P-A Gonnord notifications@github.com wrote:

What do yo mean about "all the applicable validity flags" ? Because if the date is not parseable because of some OCR failure, the check digit calculation will fail, no ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ZsBT/mrz-java/issues/15#issuecomment-329983765, or mute the thread https://github.com/notifications/unsubscribe-auth/AEu5OyPcWJa11VCn_4YyFer0Ewlec_Ckks5sjAeTgaJpZM4PXrzZ .

jaaufauvre commented 6 years ago

Hi,

Could it be possible to make the distinction between the check digit verification results and the fact the fields can actually be parsed?

Indeed the check digit value can be the right one, but a date still unparseable (I have some examples like this where the MRZ is coming from fraudulent passports).

The name of the 4 booleans in the MrzRecord class is too ambiguous in my opinion. They should be named "validDateOfBirthCheckDigit" instead of "validDateOfBirth".

The MrzDate could also have a boolean coming along with the year, month and day to indicate if the string was actually valid or not. [Edit] I just saw it's already the case in the development branch ("isValidDate").

Best regards, Alex.

GoUpNorth commented 6 years ago

@Alex-D14 There already is a flag in the MrzDate to indicate if the date is valid or not. The problem is that the date validity booleans have an effect on the checkDigits boolean.

GoUpNorth commented 6 years ago

@ZsBT Since the MrzDate year, month and day can be set to -1 in case of an unparseable date, we have to change the behaviour of the MrzDate.toMrz() function.

Right now it just formats the date properties like an mrz date ("yymmdd"). But if the original date was not parseable, this one for example, "651502", the MrzDate.toMrz() will give the following result, "65-102", which doesn't really make sense. It should be able to give the original date even if it is not a valid mrz date. What do you think ?

ZsBT commented 6 years ago

The purpose of the booleans validDateOfBirth and expiry is to quickly indicate that something is wrong with that field. Further inspection of the MrzDate object (isValidDate, and maybe a new boolean validCheckdigit?) could show what is the exact issue. @GoUpNorth yes, leaving the original date value would also help the planned "Corrector" class to guess what was the proper value on the MRZ line.

ZsBT commented 6 years ago

I close this issue as the main subject is fixed with pull request #17 and #19 . We can continue the discussion about checkdigit booleans in a new thread.