closeio / ciso8601

Fast ISO8601 date time parser for Python written in C
MIT License
565 stars 45 forks source link

Invalid datetime can produce partial parse. Should return None instead #26

Closed movermeyer closed 6 years ago

movermeyer commented 7 years ago

I came across this silent failure when parsing a third party datetime with an unusual separator:

>>> import ciso8601

>>> ciso8601.parse_datetime('2017-05-29 17:36:00Z')
datetime.datetime(2017, 5, 29, 17, 36, tzinfo=<UTC>)

#Now for a completely invalid string
>>> print(ciso8601.parse_datetime('Completely invalid!'))
None

#Now for an out of spec string (invalid separator)
>>> ciso8601.parse_datetime('2017-05-29_17:36:00Z')
datetime.datetime(2017, 5, 29, 0, 0)

It seems that the parser sees a character that it doesn't recognize and stops parsing?

I would have expected that if the entire string cannot be parsed, it would return None as parse_datetime does for other parse failures, not silently return a datetime without time or tzinfo.

This lead to a sneaky bug in the code.

thomasst commented 7 years ago

Thanks for reporting this. Right now the library returns None for some (but not all) invalid input. PRs are welcome :)

movermeyer commented 7 years ago

I'm not sure what sort of PR you'd be expecting. If it were my decision (ie. my library), I think I would change it to return None for all invalid input.

If that's what you'd like to see, I could take a shot at that. But it would be a change of behaviour, so anyone who has written code that relies on this behaviour would need to change their code.

As an example of usage that would change, this test would change to expect None

movermeyer commented 7 years ago

Further, my opinion to return None in all invalid cases, is not shared by all other users. See #6, where @aJanuary wants it to stop parsing and return when it finds junk.

thomasst commented 7 years ago

I don't think changing the behavior is bad as long as we increase the major/minor version number, or maybe introduce a strict kwarg (if it's set, we return None if the string is out-of-spec, if it's not set we stop parsing once we encounter an out-of-spec character but return any parsed part, if valid). Personally I haven't had a use case where I needed to parse invalid dates, so I'm okay with always being strict.

This library was originally designed to take valid (trusted) input and its main focus is on parsing it as efficiently as possible into a datetime, which is why parsing invalid strings may sometimes result in unexpected behavior (and there are several issues open about this).

@suhailpatel Do you have any thoughts on this?

suhailpatel commented 7 years ago

I think being strict is the best approach (eg return None if you see anything 'invalid' whilst parsing the string), otherwise you do end up with bugs like the ones mentioned where you get partial dates or partial times if you are too liberal or return partial times.

Also as we found in https://github.com/closeio/ciso8601/pull/30, the cPython PyDatetime interface does accept invalid input in some cases which then causes errors when you manipulate badly parsed datetimes so we definitely need to ensure we have good checking in our C parsing code on our side.

peterbe commented 6 years ago

I would argue that returning None is wrong entirely. Raising an exception feels a lot more pythonic:

>>> import datetime
>>> datetime.datetime.strptime('junk', '%Y-%m-%dT%H:%M:%S.%fZ')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/_strptime.py", line 565, in _strptime_datetime
    tt, fraction = _strptime(data_string, format)
  File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/_strptime.py", line 362, in _strptime
    (data_string, format))
ValueError: time data 'junk' does not match format '%Y-%m-%dT%H:%M:%S.%fZ'
peterbe commented 6 years ago

Me taking a terrible stab at changing the behavior entirely: https://github.com/closeio/ciso8601/pull/38 Help needed. And votes too :)

movermeyer commented 6 years ago

I've made an attempt at making this change. I'll be making a PR as soon as I've resolved anything that comes out of this discussion. You can see my initial comments and a [largely outdated] draft PR here.

v1.x problem

The problem with the current version of cio8601 is its error handling. In its current (v1.0.7) state:

As a developer with a given timestamp string, I cannot predict a priori what ciso8601 is going to return without looking at the code. Fundamentally, this is the problem that I'm hoping to address.

My v2 Invariant

In my mind, the best way to resolve this ambiguity (in the case of invalid timestamps) is to premise version 2 of ciso8601 on the following invariant:

parse_datetime(dt: String): datetime is a function that takes a string and either:

My current work seems to indicate that this imposes a <10% performance penalty over v1.0.7 (but I'll continue profiling to make sure there's no more optimization to do).

Conflict with parse_datetime_unaware

parse_datetime_unaware existed for the case where your input timestamp had tzinfo, but you wanted to ignore the tzinfo and therefore could save some cycles by not parsing the tzinfo characters. parse_datetime already handles both tz-aware and tz-naive timestamps.

However, shortcircuiting the parse conflicts with the v2 invariant, as it is no longer checking the entirity of the string.

For example, if you had the invalid timestamp 2014-01-05T12:34:56ABCDEFG and ran it through parse_datetime_unaware, it would not raise a ValueError, since it would stop parsing after the :56. However, this timestamp is invalid and should raise a ValueError under the v2 invariant.

My response to this conflict was to drop parse_datetime_unaware. I've never found this use case to be compelling. Ignoring tzinfo on a tz-aware timestamp is almost never what you want to do (after all, there is a reason the tzinfo exists in the first place).

If this were truly what you desired as a developer, you could still acheive this functionality by explicitly dropping the tzinfo after the parse:

    dt = parse_datetime("2018-01-01T00:00:00+05:00").replace(tzinfo=None)

So the capability is still there. It just removes the performance boost for this case in exchange for a consistent interface.

@thomasst had the following comment:

I'd like to keep parse_datetime_unaware. It could either ignore the timezone information, or only accept timezone-less datetime strings. But if we're strict here, should parse_datetime then reject datetime strings that don't have a timezone?

I'll now go through each line and rebut:

It could either ignore the timezone information

This breaks the invariant of matching the entire string, as discussed above.

or only accept timezone-less datetime strings

parse_datetime_unaware offers no value then. parse_datetime_unaware was only there for the case of better performance for timezone-aware timestamps. parse_datetime can already handle both tz-aware and tz-naive timestamps without issue.

But if we're strict here, should parse_datetime then reject datetime strings that don't have a timezone?

This would be a very bad idea. Both tz-aware and tz-naive timestamps produce valid datetimes. There is no distiction between them in the ISO 8601 spec.

Imagine that you made parse_datetime only parse tz-aware timestamps, and parse_datetime_unaware only parse tz-naive timestamps. This would then force the developer to know a priori which they are processing. If they were parsing a mixed stream of valid tz-aware and tz-naive timestamps, they would have to somehow implement a way of detecting the difference (by parsing the timestamps), which would defeat much of the purpose of using ciso8601.

Alternative Invariant

Instead of dropping parse_datetime_unaware, it is conceivable that we could come up with a different invariant that addresses the v1 problem. I have given it some thought and have yet come up with a more useful one than the one presented above.

As a compromise, I would consider renaming parse_datetime_unaware to something like parse_datetime_as_naive_unsafe (naive is the term that Python uses, not unaware) which would have a different invariant:

parse_datetime_as_naive_unsafe(datetime:str): datetime is a function that takes a string and either:

This alternative invariant would then be documented in the README. Something like

Note: that unlike parse_datetime this will not catch all invalid timestamps. It will not raise ValueError in the case of an invalid timestamp having trailing characters (ex. 2014-01-05T12:34:56ABCDEFG). Only use parse_datetime_as_naive_unsafe if you are certain your data doesn't suffer from this data quality problem (or you don't care if it does).

Note that the name change (parse_datetime_as_naive_unsafe) is done intentionally so that:

  1. Developers don't assume that it is the default way to parse naive datetimes (that is parse_datetime)
  2. There is a recognition that this function is breaking the (nicer) invariant offering perfect safety from data quality errors.

Conclusion

I am completely open to suggestions for how to better address the issue.

Finally, obviously you are the maintainers and I'll defer to your ultimate authority.

movermeyer commented 6 years ago

I've gone ahead made the changes to implement parse_datetime_as_naive_unsafe (really, just a rename and some docs changes). I like it.

If you're fine with that solution, I'll get a PR together.

thomasst commented 6 years ago

Raises an Exception (ex. ValueError) with a description of the reason why the string doesn't conform to the supported subset of ISO 8601

FYI I'm fine with that, but I'm also okay if we skip the exact reason if it benefits performance or simplifies the code.

parse_datetime_unaware existed for the case where your input timestamp had tzinfo, but you wanted to ignore the tzinfo and therefore could save some cycles by not parsing the tzinfo characters.

It also exists for the case where your code doesn't (need to) deal with aware datetimes. People have different opinions on whether to use aware or naive datetimes, but e.g. http://lucumr.pocoo.org/2011/7/15/eppur-si-muove/ suggests that "internally always use offset naive datetime objects". Personally, I also prefer to use naive datetimes throughout the code (which are assumed to be in UTC), and most of my use cases actually use parse_datetime_unaware.

It could either ignore the timezone information

This breaks the invariant of matching the entire string, as discussed above.

An idea is to still ensure that the string is valid ISO8601 in its entirety (i.e. you reject strings with an invalid time zone offset), but ignore the timezone information when returning the Python string (or add the offset to the naive datetime so it's always UTC). Would have to check what the actual performance penalty is.

or only accept timezone-less datetime strings parse_datetime can already handle both tz-aware and tz-naive timestamps without issue.

Just to be clear, would parse_datetime still always return a tz-aware datetime (even if there is no TZ information in the datetime string)?

peterbe commented 6 years ago

I really like the idea of replacing "unaware" with "naive".

I guess the use of the term "unsafe" is because a date time like 2018-05-09T17:51:46.134881-04:00 could become datetime.datetime(2018, 5, 9, 17, 51, 46, 134881) which is loss of information, and attempts to convert it to a timestamp will become a different floating point number than 2018-05-09T17:51:46.134881-04:00 would be if done correctly. But perhaps if the function was parse_datetime_as_naive the output from 2018-05-09T17:51:46.134881-04:00 could then be datetime.datetime(2018, 5, 9, 21, 51, 46, 134881). I.e. 17:51pm in America/New_York becomes 21:51pm in UTC but without the timezone info.

movermeyer commented 6 years ago

@thomasst

An idea is to still ensure that the string is valid ISO8601 in its entirety (i.e. you reject strings with an invalid time zone offset), but ignore the timezone information when returning the Python string (or add the offset to the naive datetime so it's always UTC). Would have to check what the actual performance penalty is.

Ah, I see. You are perfectly right. I've profiled it and the performance penalty is almost entirely in the pytz.FixedOffset call. So parse_datetime_unaware still could validate the entire string, including tz characters for conformance.

Just to be clear, would parse_datetime still always return a tz-aware datetime (even if there is no TZ information in the datetime string)?

No. parse_datetime returns an aware datetime if the input string has tzinfo, and a naive datetime otherwise. This is not a change from v1.0.7

I think that clears things up for me. There will be parse_datetime and parse_datetime_as_naive

TZ Aware Input TZ Naive Input
parse_datetime() output tz aware datetime tz naive datetime
parse_datetime_as_naive() output tz naive datetime tz naive datetime

Opinion

Personally, I also prefer to use naive datetimes throughout the code (which are assumed to be in UTC), and most of my use cases actually use parse_datetime_unaware.

That's fine, though I see this as a case where you should be using parse_datetime. If somehow a non-UTC datetime were to be introduced into the system, you would notice right away when a comparision between naive and aware datetimes occurred, instead of silently dropping the tzinfo by using parse_datetime_unaware. In my mind, parse_datetime_unaware should only be used at the "front doors" into your system. By using parse_datetime everywhere else, you make sure that there isn't a door you don't know about (and the performance is the same for naive timestamps). Almost all of your uses of cio8601 should be parse_datetime, not parse_datetime_unaware.

This is why I want to change the name to parse_datetime_as_naive. The as_naive forces you to acknowledge that this is not just a parser for naive timestamps. If you just needed that, you could simply use parse_datetime (with no performance penalty). Instead parse_datetime_as_naive is meant for the case where you want and need to strip the tz information for performance reasons.

Alternative names could be parse_datetime_ignore_tzinfo or parse_datetime_strip_tzinfo

thomasst commented 6 years ago

Thanks for this. Having both parse_datetime and parse_datetime_as_naive as you described sounds good to me!

movermeyer commented 6 years ago

This issues is fixed in version 2.0.0 (#43).

Attempting to parse any out-of-spec timestamp now raises a ValueError.

python-hacked commented 1 year ago

value = parse_datetime(value).date() ValueError: Invalid character while parsing year ('-', Index: 2)