Open asfimport opened 2 years ago
Rok Mihevc / @rok: I agree this is not desired behaviour. Do we have a convention for %S? I'm working on two other strptime issues right now (ARROW-15665, ARROW-14477) and can also check this while I'm "there".
Joris Van den Bossche / @jorisvandenbossche: The main problem is that there is no clear "standard" for strptime. Python for example has the "%f" field for fractional seconds(https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes). The R / lubridate parser seems to use "%OS" (https://rdrr.io/r/base/strptime.html, https://lubridate.tidyverse.org/reference/parse_date_time.html). But the C strptime does not support either of those (https://man7.org/linux/man-pages/man3/strptime.3.html; it uses "%OS", but apparently for something different?)
So as long as we use existing strptime
implementation, I don't it is possible or easy to "fix" this.
One option I was thinking about is when the users specifies an ISO-like format string, that we could use our own fast ISO parser, instead of using strptime
. But that would of course also make the support for fractional seconds dependent on the exact format you specified.
Joris Van den Bossche / @jorisvandenbossche: Looking at date.h, it seems they also have a parser, and do support decimal seconds: https://howardhinnant.github.io/date/date.html#from_stream_formatting (and using "%NS" with N the width to parse), and this is (I suppose) consistent with what will be available in C++ 20 (https://en.cppreference.com/w/cpp/chrono/parse)
Rok Mihevc / @rok: Following date.h seems like a good idea. +1 for %NS.
Joris Van den Bossche / @jorisvandenbossche: I just closed two other (older) issues as duplicate of this one, but just noting here that both of them tried to use %f for fractional seconds. For example:
>>> pc.strptime("2015-01-09 00:00:00.000", format="%Y-%m-%d %H:%M:%S.%f", unit="ns")
...
ArrowInvalid: Failed to parse string: '2015-01-09 00:00:00.000' as a scalar of type timestamp[ns]
Given that we are probably not going for "%f" (based on the discussion above), it might still be useful to give a more informative error message, for example explicitly stating that "%f" is not supported.
Carl Boettiger / @cboettig: I agree about the lack of a standard, but note that arrow itself is inconsistent about conversions between dates and strings. e.g. in https://issues.apache.org/jira/browse/ARROW-17905?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel I show an example where arrow coerces a timestamp to a string, choosing to keep OS level precision (us), which then encounters the problem shown here, where we don't have the syntax to parse it correctly. (While I think my issue is essentially a duplicate of the python issue, it's a bit more opaque from the R side where arrow supports lubridate::as_datetime(), but the actual lubridate::as_datetime() is not impacted by this issue).
Currently, we can't parse "our own" string representation of a timestamp array with the timestamp parser
strptime
:The reason for this is the fractional second part, so the following works:
Now, I think the reason that this fails is because
strptime
only supports parsing seconds as an integer (https://man7.org/linux/man-pages/man3/strptime.3.html).But, it creates a strange situation where the timestamp parser cannot parse the representation we use for timestamps.
In addition, for CSV we have a custom ISO parser (used by default), so when parsing the strings while reading a CSV file, the same string with fractional seconds does work:
I realize that you can use the generic "cast" for doing this string parsing:
But this was not the first way I thought about (I think it is quite typical to first think of
strptime
, and it is confusing that that doesn't work; the error message is also not helpful) cc @pitrou @rokReporter: Joris Van den Bossche / @jorisvandenbossche Watchers: Rok Mihevc / @rok
Related issues:
Note: This issue was originally created as ARROW-15883. Please see the migration documentation for further details.