Open cilynx opened 2 years ago
Date scanned should be another input to the decision. It's highly unlikely that the official document date should be later than the date it was scanned. For junk mail (yes, I collect metrics on my junk mail...), I often get things with no send date, but something like "offer valid through $future_date". Right now said future date is selected as the document date whereas date of scan is probably more reasonable as it will be closer to when said document was received.
"Generated" is another good keyword for when a document was....generated.
Right now the algorithm simply selects the latest date it find in the document, which isn't always the one we want.
It works for things like receipts and most bank statements, unless they talk about future dates in the fine print.
Lots of documents have magic labels like "Issue Date" or "Statement Date" physically close to a date that is probably the one we want.
If we don't have obvious labels, ranges in context could be interesting as well. If it looks like a bank statement and it has a date range, it's likely that the date we want is the end of that range. Vehicle registrations are the opposite, however, so if it looks like a registration document, the opening date of the range is probably the one we want.
Of course "looks like" is a fun concept on its own. I'll probably start with a small hand-curated set of words to look for and if I get really ambitious, I'll play with reinforcement learning.