Open cilynx opened 2 years ago
Note to self that I got a couple pieces of junk mail today that would be good test cases for this threshold -- they're almost identical other than the header text in one subsection. Looks like they sent me both the A and B for an A/B mailer test.
Need to think about how to differentiate "both A and B are logical so the docs are probably different even though the change is small" and "the difference looks like an OCR error and the docs are probably the same".
We could potentially have an "escalate to the user" function as well after import -- have a "potential duplicate" attribute that when drilled in upon would bring up the potential duplicates side by side for human confirmation with the differences highlighted. Let the use decided if we should keep one, the other, or both.
15 takes care of this for the "exact same checksum" use-case, but I think it would be fun to try to detect duplicate scanned documents as well. As the images will be different, checksums won't work. OCR won't be perfect either, so it's going to be some sort of "this extracted text looks a whole lot like that extracted text" sort of thing. Curious if I'll be able to threshold that to prevent dupes without blocking similar but not identical documents.