Closed movermeyer closed 6 years ago
Thanks for reporting this. Right now the library returns None
for some (but not all) invalid input. PRs are welcome :)
I'm not sure what sort of PR you'd be expecting. If it were my decision (ie. my library), I think I would change it to return None
for all invalid input.
If that's what you'd like to see, I could take a shot at that. But it would be a change of behaviour, so anyone who has written code that relies on this behaviour would need to change their code.
As an example of usage that would change, this test would change to expect None
Further, my opinion to return None
in all invalid cases, is not shared by all other users. See #6, where @aJanuary wants it to stop parsing and return when it finds junk.
I don't think changing the behavior is bad as long as we increase the major/minor version number, or maybe introduce a strict
kwarg (if it's set, we return None
if the string is out-of-spec, if it's not set we stop parsing once we encounter an out-of-spec character but return any parsed part, if valid). Personally I haven't had a use case where I needed to parse invalid dates, so I'm okay with always being strict.
This library was originally designed to take valid (trusted) input and its main focus is on parsing it as efficiently as possible into a datetime, which is why parsing invalid strings may sometimes result in unexpected behavior (and there are several issues open about this).
@suhailpatel Do you have any thoughts on this?
I think being strict is the best approach (eg return None if you see anything 'invalid' whilst parsing the string), otherwise you do end up with bugs like the ones mentioned where you get partial dates or partial times if you are too liberal or return partial times.
Also as we found in https://github.com/closeio/ciso8601/pull/30, the cPython PyDatetime interface does accept invalid input in some cases which then causes errors when you manipulate badly parsed datetimes so we definitely need to ensure we have good checking in our C parsing code on our side.
I would argue that returning None
is wrong entirely. Raising an exception feels a lot more pythonic:
>>> import datetime
>>> datetime.datetime.strptime('junk', '%Y-%m-%dT%H:%M:%S.%fZ')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/_strptime.py", line 565, in _strptime_datetime
tt, fraction = _strptime(data_string, format)
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/_strptime.py", line 362, in _strptime
(data_string, format))
ValueError: time data 'junk' does not match format '%Y-%m-%dT%H:%M:%S.%fZ'
Me taking a terrible stab at changing the behavior entirely: https://github.com/closeio/ciso8601/pull/38 Help needed. And votes too :)
I've made an attempt at making this change. I'll be making a PR as soon as I've resolved anything that comes out of this discussion. You can see my initial comments and a [largely outdated] draft PR here.
The problem with the current version of cio8601
is its error handling.
In its current (v1.0.7) state:
None
and others might get truncated and return an incorrect Python datetime.As a developer with a given timestamp string, I cannot predict a priori what ciso8601
is going to return without looking at the code.
Fundamentally, this is the problem that I'm hoping to address.
In my mind, the best way to resolve this ambiguity (in the case of invalid timestamps) is to premise version 2 of ciso8601
on the following invariant:
parse_datetime(dt: String): datetime
is a function that takes a string and either:
ValueError
) with a description of the reason why the string doesn't conform to the supported subset of ISO 8601My current work seems to indicate that this imposes a <10% performance penalty over v1.0.7 (but I'll continue profiling to make sure there's no more optimization to do).
parse_datetime_unaware
parse_datetime_unaware
existed for the case where your input timestamp had tzinfo
, but you wanted to ignore the tzinfo and therefore could save some cycles by not parsing the tzinfo characters.
parse_datetime
already handles both tz-aware and tz-naive timestamps.
However, shortcircuiting the parse conflicts with the v2 invariant, as it is no longer checking the entirity of the string.
For example, if you had the invalid timestamp 2014-01-05T12:34:56ABCDEFG
and ran it through parse_datetime_unaware
, it would not raise a ValueError
, since it would stop parsing after the :56
.
However, this timestamp is invalid and should raise a ValueError
under the v2 invariant.
My response to this conflict was to drop parse_datetime_unaware
. I've never found this use case to be compelling. Ignoring tzinfo on a tz-aware timestamp is almost never what you want to do (after all, there is a reason the tzinfo exists in the first place).
If this were truly what you desired as a developer, you could still acheive this functionality by explicitly dropping the tzinfo after the parse:
dt = parse_datetime("2018-01-01T00:00:00+05:00").replace(tzinfo=None)
So the capability is still there. It just removes the performance boost for this case in exchange for a consistent interface.
@thomasst had the following comment:
I'd like to keep parse_datetime_unaware. It could either ignore the timezone information, or only accept timezone-less datetime strings. But if we're strict here, should parse_datetime then reject datetime strings that don't have a timezone?
I'll now go through each line and rebut:
It could either ignore the timezone information
This breaks the invariant of matching the entire string, as discussed above.
or only accept timezone-less datetime strings
parse_datetime_unaware
offers no value then. parse_datetime_unaware
was only there for the case of better performance for timezone-aware timestamps.
parse_datetime
can already handle both tz-aware and tz-naive timestamps without issue.
But if we're strict here, should parse_datetime then reject datetime strings that don't have a timezone?
This would be a very bad idea. Both tz-aware and tz-naive timestamps produce valid datetimes. There is no distiction between them in the ISO 8601 spec.
Imagine that you made parse_datetime
only parse tz-aware timestamps, and parse_datetime_unaware
only parse tz-naive timestamps.
This would then force the developer to know a priori which they are processing. If they were parsing a mixed stream of valid tz-aware and tz-naive timestamps, they would have to somehow implement a way of detecting the difference (by parsing the timestamps), which would defeat much of the purpose of using ciso8601
.
Instead of dropping parse_datetime_unaware
, it is conceivable that we could come up with a different invariant that addresses the v1 problem.
I have given it some thought and have yet come up with a more useful one than the one presented above.
As a compromise, I would consider renaming parse_datetime_unaware
to something like parse_datetime_as_naive_unsafe
(naive is the term that Python uses, not unaware) which would have a different invariant:
parse_datetime_as_naive_unsafe(datetime:str): datetime
is a function that takes a string and either:
ValueError
) with a description of the reason why the string doesn't conform to the supported subset of ISO 8601This alternative invariant would then be documented in the README. Something like
Note: that unlike
parse_datetime
this will not catch all invalid timestamps. It will not raiseValueError
in the case of an invalid timestamp having trailing characters (ex.2014-01-05T12:34:56ABCDEFG
). Only useparse_datetime_as_naive_unsafe
if you are certain your data doesn't suffer from this data quality problem (or you don't care if it does).
Note that the name change (parse_datetime_as_naive_unsafe
) is done intentionally so that:
parse_datetime
)I am completely open to suggestions for how to better address the issue.
Finally, obviously you are the maintainers and I'll defer to your ultimate authority.
I've gone ahead made the changes to implement parse_datetime_as_naive_unsafe
(really, just a rename and some docs changes). I like it.
If you're fine with that solution, I'll get a PR together.
Raises an Exception (ex. ValueError) with a description of the reason why the string doesn't conform to the supported subset of ISO 8601
FYI I'm fine with that, but I'm also okay if we skip the exact reason if it benefits performance or simplifies the code.
parse_datetime_unaware existed for the case where your input timestamp had tzinfo, but you wanted to ignore the tzinfo and therefore could save some cycles by not parsing the tzinfo characters.
It also exists for the case where your code doesn't (need to) deal with aware datetimes. People have different opinions on whether to use aware or naive datetimes, but e.g. http://lucumr.pocoo.org/2011/7/15/eppur-si-muove/ suggests that "internally always use offset naive datetime objects". Personally, I also prefer to use naive datetimes throughout the code (which are assumed to be in UTC), and most of my use cases actually use parse_datetime_unaware
.
It could either ignore the timezone information
This breaks the invariant of matching the entire string, as discussed above.
An idea is to still ensure that the string is valid ISO8601 in its entirety (i.e. you reject strings with an invalid time zone offset), but ignore the timezone information when returning the Python string (or add the offset to the naive datetime so it's always UTC). Would have to check what the actual performance penalty is.
or only accept timezone-less datetime strings parse_datetime can already handle both tz-aware and tz-naive timestamps without issue.
Just to be clear, would parse_datetime
still always return a tz-aware datetime (even if there is no TZ information in the datetime string)?
I really like the idea of replacing "unaware" with "naive".
I guess the use of the term "unsafe" is because a date time like 2018-05-09T17:51:46.134881-04:00
could become datetime.datetime(2018, 5, 9, 17, 51, 46, 134881)
which is loss of information, and attempts to convert it to a timestamp will become a different floating point number than 2018-05-09T17:51:46.134881-04:00
would be if done correctly. But perhaps if the function was parse_datetime_as_naive
the output from 2018-05-09T17:51:46.134881-04:00
could then be datetime.datetime(2018, 5, 9, 21, 51, 46, 134881)
. I.e. 17:51pm in America/New_York becomes 21:51pm in UTC but without the timezone info.
@thomasst
An idea is to still ensure that the string is valid ISO8601 in its entirety (i.e. you reject strings with an invalid time zone offset), but ignore the timezone information when returning the Python string (or add the offset to the naive datetime so it's always UTC). Would have to check what the actual performance penalty is.
Ah, I see. You are perfectly right. I've profiled it and the performance penalty is almost entirely in the pytz.FixedOffset
call. So parse_datetime_unaware
still could validate the entire string, including tz characters for conformance.
Just to be clear, would parse_datetime still always return a tz-aware datetime (even if there is no TZ information in the datetime string)?
No. parse_datetime
returns an aware datetime if the input string has tzinfo, and a naive datetime otherwise. This is not a change from v1.0.7
I think that clears things up for me. There will be parse_datetime
and parse_datetime_as_naive
TZ Aware Input | TZ Naive Input | |
---|---|---|
parse_datetime() output |
tz aware datetime | tz naive datetime |
parse_datetime_as_naive() output |
tz naive datetime | tz naive datetime |
Personally, I also prefer to use naive datetimes throughout the code (which are assumed to be in UTC), and most of my use cases actually use parse_datetime_unaware.
That's fine, though I see this as a case where you should be using parse_datetime
. If somehow a non-UTC datetime were to be introduced into the system, you would notice right away when a comparision between naive and aware datetimes occurred, instead of silently dropping the tzinfo by using parse_datetime_unaware
. In my mind, parse_datetime_unaware
should only be used at the "front doors" into your system. By using parse_datetime
everywhere else, you make sure that there isn't a door you don't know about (and the performance is the same for naive timestamps). Almost all of your uses of cio8601 should be parse_datetime
, not parse_datetime_unaware
.
This is why I want to change the name to parse_datetime_as_naive
. The as_naive
forces you to acknowledge that this is not just a parser for naive timestamps. If you just needed that, you could simply use parse_datetime
(with no performance penalty). Instead parse_datetime_as_naive
is meant for the case where you want and need to strip the tz information for performance reasons.
Alternative names could be parse_datetime_ignore_tzinfo
or parse_datetime_strip_tzinfo
Thanks for this. Having both parse_datetime
and parse_datetime_as_naive
as you described sounds good to me!
This issues is fixed in version 2.0.0
(#43).
Attempting to parse any out-of-spec timestamp now raises a ValueError
.
value = parse_datetime(value).date() ValueError: Invalid character while parsing year ('-', Index: 2)
I came across this silent failure when parsing a third party datetime with an unusual separator:
It seems that the parser sees a character that it doesn't recognize and stops parsing?
I would have expected that if the entire string cannot be parsed, it would return
None
asparse_datetime
does for other parse failures, not silently return a datetime without time ortzinfo
.This lead to a sneaky bug in the code.