arrow-py / arrow

🏹 Better dates & times for Python
https://arrow.readthedocs.io
Apache License 2.0
8.71k stars 673 forks source link

Consider handling incomplete dates and times #829

Closed workingenius closed 4 years ago

workingenius commented 4 years ago

Feature Request

In our project working on plain texts, we encounter many dates like "2017-05", "August 3rd", which contains only partly information. When I do arrow.get("2017-05"), it would be <Arrow [2017-08-01T00:00:00+00:00]>. The lack of day is handled by adding a default value to be "01", and we even could not know it was originally incomplete.

But we want the "lack of information" to be preserved, so that we could decide how to deal with it later. Maybe we just throw the incomplete dates away, maybe we want it to set another default value like 15th, maybe we do some calculation and get the correct day, and fill up the incomplete date to be a complete one.

I wonder if arrow could handle such problems. Thank you!

jadchaar commented 4 years ago

Hi @workingenius thanks for reaching out. You could potentially specify a custom format string and if data is missing, a ParserMatchError will be raised:

env ❯ python3
Python 3.8.3 (default, Jul  8 2020, 14:27:55)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import arrow
>>> arrow.get("2017-05")
<Arrow [2017-05-01T00:00:00+00:00]>
>>> arrow.get("2017-05", "YYYY-MM-DD")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jadchaar/Downloads/env/lib/python3.8/site-packages/arrow/api.py", line 21, in get
    return _factory.get(*args, **kwargs)
  File "/Users/jadchaar/Downloads/env/lib/python3.8/site-packages/arrow/factory.py", line 243, in get
    dt = parser.DateTimeParser(locale).parse(args[0], args[1])
  File "/Users/jadchaar/Downloads/env/lib/python3.8/site-packages/arrow/parser.py", line 226, in parse
    raise ParserMatchError(
arrow.parser.ParserMatchError: Failed to match 'YYYY-MM-DD' when parsing '2017-05'

You can even pass a list of formats to arrow.get and if none of them match, a ParserMatchError is raised:

env ❯ python3
Python 3.8.3 (default, Jul  8 2020, 14:27:55)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import arrow
>>> arrow.get("2017-05")
<Arrow [2017-05-01T00:00:00+00:00]>
>>> arrow.get("2017-05", ["YYYY-MM-DD", "YYYY-MM-DDTHH"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jadchaar/Downloads/env/lib/python3.8/site-packages/arrow/api.py", line 21, in get
    return _factory.get(*args, **kwargs)
  File "/Users/jadchaar/Downloads/env/lib/python3.8/site-packages/arrow/factory.py", line 243, in get
    dt = parser.DateTimeParser(locale).parse(args[0], args[1])
  File "/Users/jadchaar/Downloads/env/lib/python3.8/site-packages/arrow/parser.py", line 219, in parse
    return self._parse_multiformat(datetime_string, fmt)
  File "/Users/jadchaar/Downloads/env/lib/python3.8/site-packages/arrow/parser.py", line 530, in _parse_multiformat
    raise ParserError(
arrow.parser.ParserError: Could not match input '2017-05' to any of the following formats: YYYY-MM-DD, YYYY-MM-DDTHH

You can then wrap this call in a try/except or keep these "faulty" date strings in a data structure to handle them later (e.g. manually mutate the Arrow object to add a new default value).

workingenius commented 4 years ago

@jadchaar Thank you for your advice. I'll try specifying format strings and ParserError.

But, once the Arrow object is initiated successfully, the incomplete info are missed, like "if it was an incomplete date", and "which part were unknown". Of course, I can create my own class encapsulating those information along with an Arrow object, but it looses all methods that arrow supports, e.g. they could not compare with each other (partial dates can compare sometimes, like '2019-02' > '2018-02'), they could not shift anymore (like three months after '2018-02' should be '2018-05').

In our project, we wrote our own Date class (we don't care time for now), and both month and day is allowed to be None for Unknown. And it also has some simple methods (far less than arrow has implemented) for comparing, setting default values, checking if it is complete, diff two to get a timedelta. So I wonder, if the need for incomplete date handling is general enough, so we could do something independent of a certain project. It's just a rough idea, how do you think about it?

jadchaar commented 4 years ago

I talked with @krisfremen and @systemcatch and we think this is functionality best kept to the user since it is a niche use case and official support will require big changes to the API that Arrow provides. We think subclassing Arrow to expand its functionality may be the best course of action here.

This kind of ties in with the idea of fuzzy parsing (https://github.com/arrow-py/arrow/issues/409) and internally marking the parsing that we are unsure about. If we decide to go through with the fuzzy parsing, we will revisit this proposal.

jadchaar commented 4 years ago

If you would like to contribute this feature to Arrow though, we'd be happy to review the PR.

workingenius commented 4 years ago

Yeah, I also think it will bring a big change on current API. The code may even no longer "look like arrow".

A simple example, binary compare operators are not going to work. ">" "<" can only return True or False, which has no place to hold "Unknown". So we must change it to something like:

# a = arrow.incomplete_get('...')  # suppose we have that "incomplete_get"
# b = arrow.incomplete_get('...')
is_greater_than = a.gt(b)
if is_greater_than is True:
    pass
elif is_greater_than is False:
    pass
elif is_greater_than is arrow.UNKNOWN:
    pass

Or if it's a little refined:

if a.gt(b):
    pass

if not a.gt(b):
    pass

if a.gt(b).is_unknown:
    pass

Even __equal__ will be no longer naive. What if incomplete_get("2016-04-xx") == incomplete_get("2016-04-xx") ? Should it be True or False? They are likely to indicate two different dates, but on the other hand, they contain just the same information and sometimes we want to treat them equally.

There are other things to be considered carefully. I'll recheck every part and see if it can be done better. Something useful enough to cover most cases, elegant(to some extent), and without big impact to current API. Or it may just turn out to be an adhoc need. Let's see.