apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.59k stars 3.54k forks source link

Convert string time to timestamp. #13111

Closed Milad137 closed 2 years ago

Milad137 commented 2 years ago

I have a csv file with datetime formatted as "%d/%m/%Y %I:%M %p". I can convert them to pandas datetime but when I use pyarrow's strptime I get the following error. ArrowInvalid: Failed to parse string: '9/03/2043 12:35 AM' as a scalar of type timestamp[s]

Code to reproduce:

import pandas as pd
import pyarrow.compute as pc

# pd.__version__ == '1.3.4'
# pyarrow.__version__ == '8.0.0'

_ts = ["9/03/2043 12:35 AM"]
_format = "%d/%m/%Y %I:%M %p"

pd.to_datetime(_ts, format=_format)
pc.strptime(_ts, format= _format, unit='s')
AlenkaF commented 2 years ago

Unfortunately I wasn't able to reproduce locally:

>>> import pyarrow
>>> import pandas as pd
>>> pyarrow.__version__
'8.0.0'
>>> pd.__version__
'1.4.2'

>>> import pyarrow.compute as pc
>>> _ts = ["9/03/2043 12:35 AM"]
>>> _format = "%d/%m/%Y %I:%M %p"

>>> pd.to_datetime(_ts, format=_format)
DatetimeIndex(['2043-03-09 00:35:00'], dtype='datetime64[ns]', freq=None)
>>> pc.strptime(_ts, format= _format, unit='s')
<pyarrow.lib.TimestampArray object at 0x7fcae0078e80>
[
  2043-03-09 00:35:00
]
rok commented 2 years ago

@Milad137 what OS are you on?

Milad137 commented 2 years ago

@rok

@Milad137 what OS are you on?

I'm on windows. I just ran it on a Linux machine, and it worked fine. It must be an OS issue, then.

rok commented 2 years ago

@Milad137 Arrow is using arrow_strptime on windows and system strptime on other systems. There might be a feature (available strptime flags) gap. Could you please test a subset of flags and see if a particular one is missing for you? E.g.:

_ts = ["9/03/2043 12:35"]
_format = "%d/%m/%Y %I:%M"

_ts = ["9/03/2043"]
_format = "%d/%m/%Y"

_ts = ["12:35 AM"]
_format = "%I:%M %p"
Milad137 commented 2 years ago

@rok The first 2 are working fine, it's just the last one, when the "%p" tag gets added, it throws an error. I ran it on two windows machines, still getting the same error.

>>> import pandas as pd
>>> import pyarrow.compute as pc

>>> # pd.__version__ == '1.3.4'
>>> # pyarrow.__version__ == '8.0.0'

>>> _ts = ["9/03/2043 12:35"]
>>> _format = "%d/%m/%Y %I:%M"
>>> print(pc.strptime(_ts, format= _format, unit='s'))
[
  2043-03-09 12:35:00
]

>>> _ts = ["9/03/2043"]
>>> _format = "%d/%m/%Y"
>>> print(pc.strptime(_ts, format= _format, unit='s'))
[
  2043-03-09 00:00:00
]

>>> _ts = ["12:35 AM"]
>>> _format = "%I:%M %p"
>>> print(pc.strptime(_ts, format= _format, unit='s'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~\anaconda3\lib\site-packages\pyarrow\compute.py", line 250, in wrapper
    return func.call(args, options, memory_pool)
  File "pyarrow\_compute.pyx", line 341, in pyarrow._compute.Function.call
  File "pyarrow\error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Failed to parse string: '12:35 AM' as a scalar of type timestamp[s]
jorisvandenbossche commented 2 years ago

That seems a limitation of the vendored strptime implementation then. @rok do you want to open a JIRA for this?

pitrou commented 2 years ago

The vendored strptime implementation does support %p, but we have to disable it on Windows because the behavior is locale-dependent and it's not easy to support there.

(of course, we may support it in a non-locale-dependent way, i.e. always use the same values as a "C" Unix locale)

rok commented 2 years ago

Sorry I lost sight on this one. Opened ARROW-16618. Similarly we have ARROW-16443 reported from R. Is there an easy way for users to change locale to "C" as a workaround @pitrou?

RandomFractals commented 2 years ago

I ran into the same issue while trying to convert string dates in Chicago crimes datatset.

@pitrou What was the resolution of this ticket, or a workaround to convert date strings with AM/PM strings on Windows? I am on win10.

Data repo: https://github.com/RandomFractals/chicago-crimes

Error pic with more info:

chi-crimes-pyarrow-strptime-error

Milad137 commented 2 years ago

My workaround was to use VSCode dev container.

rok commented 2 years ago

Hey @RandomFractals! There was no work done on ARROW-16618 as there wasn't much expressed interest. Right now @Milad137's workaround might be the best good idea.

RandomFractals commented 2 years ago

My workaround was to use VSCode dev container.

not an option. seems like a very basic Python date conversion issue. I'll see if maybe Py 3.11 fixes that.

Milad137 commented 2 years ago

My workaround was to use VSCode dev container.

not an option. seems like a very basic Python date conversion issue. I'll see if maybe Py 3.11 fixes that.

It's not a python issue, even if you wait til python 4 comes out still this issue won't be fixed. if you're keen on fixing it, correct this file arrow_strptime and then recompile arrow.

RandomFractals commented 2 years ago

@Milad137 I am not so sure about that. I could not get those timestamps converted with Polars either:

https://twitter.com/TarasNovak/status/1584884282518306817

Probably b/c Polars uses arrow a lot under the hood too. 😄

RandomFractals commented 2 years ago

also, @pitrou, if this is arrow strptime issue, could you please reopen this ticket?

I would like to to use pyarrow on Windows and this seems like very basic timestamp issue that is core to working with datasets with string timestamps.

Related ticket in my repo: https://github.com/RandomFractals/chicago-crimes/issues/21

pitrou commented 2 years ago

@RandomFractals Arrow uses a vendored copy of musl's strptime, where we had to disable all format codes that use langinfo. So %p unfortunately won't work on Windows.

If you know of a more complete implementation of strptime that works on Windows, feel free to suggest it.

RandomFractals commented 2 years ago

@pitrou so, basically, this renders pyarrow useless for devs using Jupyter notebooks on windows with datasets that have string timestamps with AM/PM time parts.

I don't have better suggestions. Sounds like an important internal issue for the core arrow dev team to look into and resolve. Thanks for your explanation of it.

rok commented 2 years ago

@RandomFractals Arrow is juggling several priorities. For instance we have several open strptime issues. However it's good to know there's interest for this particular issue to be fixed so that it can be prioritised when someone has time to work in this space.

davlee1972 commented 1 year ago

Wouldn't it be cleaner to add a locale parameter to strptime?

pyarrow.compute.strptime(strings, /, format, unit, error_is_null=False, *, options=None, memory_pool=None)

strftime does support a locale parameter to convert timestamps to string

pyarrow.compute.strftime(timestamps, /, format='%Y-%m-%dT%H:%M:%S', locale='C', *, options=None, memory_pool=None)¶

This would also need to be added to pyarrow.csv.ConvertOptions. The timestamp_parsers currently is just a list of strptime format strings without any type of locale.

https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html#pyarrow-csv-convertoptions

_timestampparsers A sequence of strptime()-compatible format strings, tried in order when attempting to infer or convert timestamp values (the special value ISO8601() can also be given). By default, a fast built-in ISO-8601 parser is used.

davlee1972 commented 1 year ago

I got a temporary solution which modifies pyarrow.compute.strptime(),

Here's the completed code.. edit - removed pc.and_() expressions.

import pyarrow.compute as pc
import pyarrow as pa

old_strpttime = pc.strptime

def new_strptime(*args, **kwargs):
    if 'format' in kwargs and '%p' in kwargs['format'] and '%I' in kwargs['format']:
        ## is_am = pc.count_substring(args[0], 'AM').cast("bool")
        is_am = pc.count_substring(args[0], 'AM')
        kwargs['format'] = kwargs['format'].replace('%p','')
        strings = pc.replace_substring(pc.replace_substring(args[0], 'AM', ''), 'PM', '')
        new_timestamps = old_strpttime(strings, **kwargs)
        hour_adjustments = pc.case_when(pc.make_struct(
            ## pc.and_(is_am, pc.equal(pc.hour(new_timestamps), 12)),
            ## pc.and_(pc.invert(is_am), pc.equal(pc.hour(new_timestamps), 12)),
            ## is_am,
            ## pc.invert(is_am)
            pc.equal(pc.hour(new_timestamps), 12),
            ),
            ## -12 * 3600,
            ## 0,
            ## 0,
            ## 12 * 3600
            pc.multiply(is_am, -12 * 3600),
            pc.multiply(pc.add(is_am, -1), -12 * 3600)
        )
        ## hour_adjustments = pa.cast(hour_adjustments, "duration[s]")
        hour_adjustments = hour_adjustments.cast("int64").cast("duration[s]")
        final_timestamps = pc.add(new_timestamps, hour_adjustments)
        return final_timestamps
    else:
        return old_strpttime(*args, **kwargs)

pc.strptime = new_strptime

Testing results:

>>> _ts = ["9/03/2023 12:35 AM", "9/03/2023 12:35 PM", "9/03/2023 6:35 AM", "9/03/2023 6:35 PM"]
>>> _format = "%d/%m/%Y %I:%M %p"
>>>
>>> pc.strptime(_ts, format= _format, unit='s')
<pyarrow.lib.TimestampArray object at 0x00000206B9888D00>
[
  2023-03-09 00:35:00,
  2023-03-09 12:35:00,
  2023-03-09 06:35:00,
  2023-03-09 18:35:00
]

removed pc.and_() expressions. For some reasons using this strptime replacement function with the dataset api produces the following error

>>> dataset.to_table(columns=columns)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow\_dataset.pyx", line 537, in pyarrow._dataset.Dataset.to_table
  File "pyarrow\_dataset.pyx", line 383, in pyarrow._dataset.Dataset.scanner
  File "pyarrow\_dataset.pyx", line 3202, in pyarrow._dataset.Scanner.from_dataset
  File "pyarrow\_dataset.pyx", line 3120, in pyarrow._dataset.Scanner._make_scan_options
  File "pyarrow\_dataset.pyx", line 3067, in pyarrow._dataset._populate_builder
  File "pyarrow\error.pxi", line 119, in pyarrow.lib.check_status
pyarrow.lib.ArrowKeyError: No function registered with name: and_