apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.43k stars 3.51k forks source link

[C++] Support for fractional seconds in strptime() #20146

Open asfimport opened 2 years ago

asfimport commented 2 years ago

Currently, we can't parse "our own" string representation of a timestamp array with the timestamp parser strptime:


import datetime
import pyarrow as pa
import pyarrow.compute as pc

>>> pa.array([datetime.datetime(2022, 3, 5, 9)])
<pyarrow.lib.TimestampArray object at 0x7f00c1d53dc0>
[
  2022-03-05 09:00:00.000000
]

# trying to parse the above representation as string
>>> pc.strptime(["2022-03-05 09:00:00.000000"], format="%Y-%m-%d %H:%M:%S", unit="us")
...
ArrowInvalid: Failed to parse string: '2022-03-05 09:00:00.000000' as a scalar of type timestamp[us]

The reason for this is the fractional second part, so the following works:


>>> pc.strptime(["2022-03-05 09:00:00"], format="%Y-%m-%d %H:%M:%S", unit="us")
<pyarrow.lib.TimestampArray object at 0x7f00c1d6f940>
[
  2022-03-05 09:00:00.000000
]

Now, I think the reason that this fails is because strptime only supports parsing seconds as an integer (https://man7.org/linux/man-pages/man3/strptime.3.html).

But, it creates a strange situation where the timestamp parser cannot parse the representation we use for timestamps.

In addition, for CSV we have a custom ISO parser (used by default), so when parsing the strings while reading a CSV file, the same string with fractional seconds does work:


s = b"""a
2022-03-05 09:00:00.000000"""

from pyarrow import csv

>>> csv.read_csv(io.BytesIO(s))
pyarrow.Table
a: timestamp[ns]
----
a: [[2022-03-05 09:00:00.000000000]]

I realize that you can use the generic "cast" for doing this string parsing:


>>> pc.cast(["2022-03-05 09:00:00.000000"], pa.timestamp("us"))
<pyarrow.lib.TimestampArray object at 0x7f00c1d53d60>
[
  2022-03-05 09:00:00.000000
]

But this was not the first way I thought about (I think it is quite typical to first think of strptime, and it is confusing that that doesn't work; the error message is also not helpful) cc @pitrou @rok

Reporter: Joris Van den Bossche / @jorisvandenbossche Watchers: Rok Mihevc / @rok

Related issues:

Note: This issue was originally created as ARROW-15883. Please see the migration documentation for further details.

asfimport commented 2 years ago

Rok Mihevc / @rok: I agree this is not desired behaviour. Do we have a convention for %S? I'm working on two other strptime issues right now (ARROW-15665, ARROW-14477) and can also check this while I'm "there".

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: The main problem is that there is no clear "standard" for strptime. Python for example has the "%f" field for fractional seconds(https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes). The R / lubridate parser seems to use "%OS" (https://rdrr.io/r/base/strptime.html, https://lubridate.tidyverse.org/reference/parse_date_time.html). But the C strptime does not support either of those (https://man7.org/linux/man-pages/man3/strptime.3.html; it uses "%OS", but apparently for something different?)

So as long as we use existing strptime implementation, I don't it is possible or easy to "fix" this.

One option I was thinking about is when the users specifies an ISO-like format string, that we could use our own fast ISO parser, instead of using strptime. But that would of course also make the support for fractional seconds dependent on the exact format you specified.

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: Looking at date.h, it seems they also have a parser, and do support decimal seconds: https://howardhinnant.github.io/date/date.html#from_stream_formatting (and using "%NS" with N the width to parse), and this is (I suppose) consistent with what will be available in C++ 20 (https://en.cppreference.com/w/cpp/chrono/parse)

asfimport commented 2 years ago

Rok Mihevc / @rok: Following date.h seems like a good idea. +1 for %NS.

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: I just closed two other (older) issues as duplicate of this one, but just noting here that both of them tried to use %f for fractional seconds. For example:


>>> pc.strptime("2015-01-09 00:00:00.000", format="%Y-%m-%d %H:%M:%S.%f", unit="ns")
...
ArrowInvalid: Failed to parse string: '2015-01-09 00:00:00.000' as a scalar of type timestamp[ns]

Given that we are probably not going for "%f" (based on the discussion above), it might still be useful to give a more informative error message, for example explicitly stating that "%f" is not supported.

asfimport commented 2 years ago

Carl Boettiger / @cboettig: I agree about the lack of a standard, but note that arrow itself is inconsistent about conversions between dates and strings.  e.g. in https://issues.apache.org/jira/browse/ARROW-17905?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel I show an example where arrow coerces a timestamp to a string, choosing to keep OS level precision (us), which then encounters the problem shown here, where we don't have the syntax to parse it correctly.  (While I think my issue is essentially a duplicate of the python issue, it's a bit more opaque from the R side where arrow supports lubridate::as_datetime(), but the actual lubridate::as_datetime() is not impacted by this issue).