facebook / duckling

Language, engine, and tooling for expressing, testing, and evaluating composable language rules on input strings.
Other
4.05k stars 720 forks source link

Ambiguous years in `dd/mm/yy` format #667

Open emlautarom1 opened 2 years ago

emlautarom1 commented 2 years ago

One of the most common formats for dates is dd/mm/yy (and variants like dd.mm.yy) which sadly can introduce a lot of ambiguities. For example, consider the input 3/4/10. Valid interpretations are:

But also, consider what happens with the year:

As you can see, a particular ambiguity is the year, mainly since we are accepting only 2 digits as valid years. Duckling then converts 2-digits to a year between 1950 and 2050 through the year helper:

https://github.com/facebook/duckling/blob/dd70d80dc1c9b47a8b68ce36ba12c0da4e376d3e/Duckling/Time/Helpers.hs#L446-L450

For example, the number 3 gets converted to the year 2003, 20 to 2020 and 40 to 2040

The issue here is that we are ignoring perfectly valid interpretations of inputs like "1/1/40" as January 1, 1940. Currently, we only get January 1, 2040 as the only answer. It's true that in practice, if we're dealing with 2 digit years we assume that the year must be close to the reference time.

Here the first issue: the current year helper converts 2 digits to years between 1950 and 2050 independently of the reference time. If we used 1/1/1821 as reference time, then dates like "1/1/25would returnJanuary 1, 2025, whenJanuary 1, 1825` is probably the desired answer.


My idea is to recreate the current months behavior but for years. If we say April, then Duckling returns a list of the next three Aprils, relative to the year of the reference time. For years, this would mean that if we say 20, then Duckling would return a list of valid years, like 1920, 2020 and 2120. The main difference is that the resulting months are always "in the future", while years should also be considered to be in the past, at least one answer. We could, for example, take the last two years and the next one, so in case of 20 we return 1820, 1920 and 2020 (with 2013 as reference year).

As a first instance, I would remove the conversion from the year predicate and leave the input number untouched. Then I would take a look into the runYearPredicate and try to mimic the runMonthPredicate:

https://github.com/facebook/duckling/blob/dd70d80dc1c9b47a8b68ce36ba12c0da4e376d3e/Duckling/Time/Types.hs#L555-L577

Sadly, I get stuck in this part since I don't really understand how Duckling runs predicates. I would appreciate any hints on how this could work, if it's feasible and also possible regressions that this change would introduce.