comtravo / ctparse

Parse natural language time expressions in python
https://www.comtravo.com
MIT License
131 stars 23 forks source link

Year gets parsed as hours and minutes #132

Open yunus-decathlon opened 1 year ago

yunus-decathlon commented 1 year ago

Description

Hi, thanks for making this! I'm trying to parse a simple string like 17. August 2020, but the year gets interpreted as hours and minutes. Am I missing something?

What I Did

In [6]: for res in ctparse('Datum: 17. August 2020', ts=datetime.now(), debug=True):
   ...:     print(res)
   ...:
2023-08-17 20:20 (X/X) s=-367.858 p=(108, 103, 130, 'ruleNamedMonth', 'ruleHHMMmilitary', 'ruleDOM1', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleDateTOD')
2023-06-17 X:X (X/X) s=-1989.968 p=(108, 103, 130, 'ruleNamedMonth', 'ruleHHMMmilitary', 'ruleDOM1', 'ruleLatentDOM')
X-08-X X:X (X/X) s=-1296.821 p=(108, 103, 130, 'ruleNamedMonth', 'ruleHHMMmilitary', 'ruleDOM1', 'ruleLatentDOM')
2023-06-01 20:20 (X/X) s=-1702.286 p=(108, 103, 130, 'ruleNamedMonth', 'ruleHHMMmilitary', 'ruleDOM1', 'ruleLatentDOM')
2023-08-17 X:X (X/X) s=-781.179 p=(108, 103, 130, 'ruleNamedMonth', 'ruleDOM1', 'ruleDOMMonth', 'ruleLatentDOY')
2023-08-17 20:20 (X/X) s=-357.082 p=(108, 103, 130, 'ruleHHMMmilitary', 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleDateTOD')
2023-06-17 X:X (X/X) s=-1984.401 p=(108, 103, 130, 'ruleHHMMmilitary', 'ruleDOM1', 'ruleNamedMonth', 'ruleLatentDOM')
X-08-X X:X (X/X) s=-1291.254 p=(108, 103, 130, 'ruleHHMMmilitary', 'ruleDOM1', 'ruleNamedMonth', 'ruleLatentDOM')
2023-06-01 20:20 (X/X) s=-1696.719 p=(108, 103, 130, 'ruleHHMMmilitary', 'ruleDOM1', 'ruleNamedMonth', 'ruleLatentDOM')
2023-08-17 X:X (X/X) s=-773.528 p=(108, 103, 130, 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY')
2020-X-X X:X (X/X) s=-1701.773 p=(126, 111, 'ruleDDMM', 'ruleLatentDOY', 'ruleYear')
2020-08-17 X:X (X/X) s=-379.974 p=(126, 111, 'ruleDDMM', 'ruleYear', 'ruleDOYYear')
X-08-17 X:X (X/X) s=-790.539 p=(126, 111, 'ruleYear', 'ruleDDMM')
2020-X-X X:X (X/X) s=-1701.432 p=(108, 103, 111, 'ruleNamedMonth', 'ruleDOM1', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleYear')
2020-08-17 X:X (X/X) s=-379.009 p=(108, 103, 111, 'ruleNamedMonth', 'ruleDOM1', 'ruleDOMMonth', 'ruleYear', 'ruleDOYYear')
X-X-17 X:X (X/X) s=-1994.620 p=(108, 103, 111, 'ruleNamedMonth', 'ruleDOM1', 'ruleYear', 'ruleMonthYear')
2020-08-X X:X (X/X) s=-695.337 p=(108, 103, 111, 'ruleNamedMonth', 'ruleDOM1', 'ruleYear', 'ruleMonthYear')
2020-08-17 X:X (X/X) s=-372.308 p=(108, 103, 111, 'ruleYear', 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleDOYYear')
2020-X-X X:X (X/X) s=-1694.826 p=(108, 103, 111, 'ruleYear', 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY')
sebastianmika commented 1 year ago

Hi, the problem you are facing is that ctparse has a build-in (strong) bias of parsing current or future dates, relative to the reference time. Hence parsing this date in 2023 will strongly favour anything in 2023. And of the top of my head I see no easy way to solve that (it was a design decision matching the application this was build for). Of course, if can adjust your reference time to something in the past, it will work. However, at the price of other features not working as you would expect anymore (e.g. "tomorrow").

yunus-decathlon commented 1 year ago

I see, thanks for clarifying! Would it make sense to have a rule like ruleDDNamedMonthYYYY, which is a common way of writing dates? There already is ruleDDMMYYYY.

sebastianmika commented 1 year ago

Sure, go ahead and give it a try (and then maybe in a way that also covers things like "May 20th, 2023"). What might happen is that this breaks other productions. Maybe you can just extend the ruleDDMMYYYY to accept a named month - ruleDDMM does it already.

But actually, looking at your example, it is mainly a scoring problem - here I ordered the resolutions by score and there isn't much missing:

2023-08-17 20:20 (X/X) s=-357.082 p=(108, 103, 130, 'ruleHHMMmilitary', 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleDateTOD')
2023-08-17 20:20 (X/X) s=-367.858 p=(108, 103, 130, 'ruleNamedMonth', 'ruleHHMMmilitary', 'ruleDOM1', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleDateTOD')
2020-08-17 X:X (X/X) s=-372.308 p=(108, 103, 111, 'ruleYear', 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleDOYYear')
2020-08-17 X:X (X/X) s=-379.009 p=(108, 103, 111, 'ruleNamedMonth', 'ruleDOM1', 'ruleDOMMonth', 'ruleYear', 'ruleDOYYear')
2020-08-17 X:X (X/X) s=-379.974 p=(126, 111, 'ruleDDMM', 'ruleYear', 'ruleDOYYear')
2020-08-X X:X (X/X) s=-695.337 p=(108, 103, 111, 'ruleNamedMonth', 'ruleDOM1', 'ruleYear', 'ruleMonthYear')
2023-08-17 X:X (X/X) s=-773.528 p=(108, 103, 130, 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY')
2023-08-17 X:X (X/X) s=-781.179 p=(108, 103, 130, 'ruleNamedMonth', 'ruleDOM1', 'ruleDOMMonth', 'ruleLatentDOY')
X-08-17 X:X (X/X) s=-790.539 p=(126, 111, 'ruleYear', 'ruleDDMM')
X-08-X X:X (X/X) s=-1291.254 p=(108, 103, 130, 'ruleHHMMmilitary', 'ruleDOM1', 'ruleNamedMonth', 'ruleLatentDOM')
X-08-X X:X (X/X) s=-1296.821 p=(108, 103, 130, 'ruleNamedMonth', 'ruleHHMMmilitary', 'ruleDOM1', 'ruleLatentDOM')
2020-X-X X:X (X/X) s=-1694.826 p=(108, 103, 111, 'ruleYear', 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY')
2023-06-01 20:20 (X/X) s=-1696.719 p=(108, 103, 130, 'ruleHHMMmilitary', 'ruleDOM1', 'ruleNamedMonth', 'ruleLatentDOM')
2020-X-X X:X (X/X) s=-1701.773 p=(126, 111, 'ruleDDMM', 'ruleLatentDOY', 'ruleYear')
2023-06-01 20:20 (X/X) s=-1702.286 p=(108, 103, 130, 'ruleNamedMonth', 'ruleHHMMmilitary', 'ruleDOM1', 'ruleLatentDOM')
2023-06-17 X:X (X/X) s=-1984.401 p=(108, 103, 130, 'ruleHHMMmilitary', 'ruleDOM1', 'ruleNamedMonth', 'ruleLatentDOM')
2023-06-17 X:X (X/X) s=-1989.968 p=(108, 103, 130, 'ruleNamedMonth', 'ruleHHMMmilitary', 'ruleDOM1', 'ruleLatentDOM')
2020-X-X X:X (X/X) s=-1701.432 p=(108, 103, 111, 'ruleNamedMonth', 'ruleDOM1', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleYear')
X-X-17 X:X (X/X) s=-1994.620 p=(108, 103, 111, 'ruleNamedMonth', 'ruleDOM1', 'ruleYear', 'ruleMonthYear')

So maybe adding more examples of this kind to the data set used to train the scorer is the easier way to solve this problem.