Closed Milad137 closed 2 years ago
Unfortunately I wasn't able to reproduce locally:
>>> import pyarrow
>>> import pandas as pd
>>> pyarrow.__version__
'8.0.0'
>>> pd.__version__
'1.4.2'
>>> import pyarrow.compute as pc
>>> _ts = ["9/03/2043 12:35 AM"]
>>> _format = "%d/%m/%Y %I:%M %p"
>>> pd.to_datetime(_ts, format=_format)
DatetimeIndex(['2043-03-09 00:35:00'], dtype='datetime64[ns]', freq=None)
>>> pc.strptime(_ts, format= _format, unit='s')
<pyarrow.lib.TimestampArray object at 0x7fcae0078e80>
[
2043-03-09 00:35:00
]
@Milad137 what OS are you on?
@rok
@Milad137 what OS are you on?
I'm on windows. I just ran it on a Linux machine, and it worked fine. It must be an OS issue, then.
@Milad137 Arrow is using arrow_strptime
on windows and system strptime
on other systems. There might be a feature (available strptime flags) gap.
Could you please test a subset of flags and see if a particular one is missing for you? E.g.:
_ts = ["9/03/2043 12:35"]
_format = "%d/%m/%Y %I:%M"
_ts = ["9/03/2043"]
_format = "%d/%m/%Y"
_ts = ["12:35 AM"]
_format = "%I:%M %p"
@rok The first 2 are working fine, it's just the last one, when the "%p" tag gets added, it throws an error. I ran it on two windows machines, still getting the same error.
>>> import pandas as pd
>>> import pyarrow.compute as pc
>>> # pd.__version__ == '1.3.4'
>>> # pyarrow.__version__ == '8.0.0'
>>> _ts = ["9/03/2043 12:35"]
>>> _format = "%d/%m/%Y %I:%M"
>>> print(pc.strptime(_ts, format= _format, unit='s'))
[
2043-03-09 12:35:00
]
>>> _ts = ["9/03/2043"]
>>> _format = "%d/%m/%Y"
>>> print(pc.strptime(_ts, format= _format, unit='s'))
[
2043-03-09 00:00:00
]
>>> _ts = ["12:35 AM"]
>>> _format = "%I:%M %p"
>>> print(pc.strptime(_ts, format= _format, unit='s'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "~\anaconda3\lib\site-packages\pyarrow\compute.py", line 250, in wrapper
return func.call(args, options, memory_pool)
File "pyarrow\_compute.pyx", line 341, in pyarrow._compute.Function.call
File "pyarrow\error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Failed to parse string: '12:35 AM' as a scalar of type timestamp[s]
That seems a limitation of the vendored strptime implementation then. @rok do you want to open a JIRA for this?
The vendored strptime implementation does support %p
, but we have to disable it on Windows because the behavior is locale-dependent and it's not easy to support there.
(of course, we may support it in a non-locale-dependent way, i.e. always use the same values as a "C" Unix locale)
Sorry I lost sight on this one. Opened ARROW-16618. Similarly we have ARROW-16443 reported from R. Is there an easy way for users to change locale to "C" as a workaround @pitrou?
I ran into the same issue while trying to convert string dates in Chicago crimes datatset.
@pitrou What was the resolution of this ticket, or a workaround to convert date strings with AM/PM strings on Windows? I am on win10.
Data repo: https://github.com/RandomFractals/chicago-crimes
Error pic with more info:
My workaround was to use VSCode dev container.
Hey @RandomFractals! There was no work done on ARROW-16618 as there wasn't much expressed interest. Right now @Milad137's workaround might be the best good idea.
My workaround was to use VSCode dev container.
not an option. seems like a very basic Python date conversion issue. I'll see if maybe Py 3.11 fixes that.
My workaround was to use VSCode dev container.
not an option. seems like a very basic Python date conversion issue. I'll see if maybe Py 3.11 fixes that.
It's not a python issue, even if you wait til python 4 comes out still this issue won't be fixed. if you're keen on fixing it, correct this file arrow_strptime and then recompile arrow.
@Milad137 I am not so sure about that. I could not get those timestamps converted with Polars either:
https://twitter.com/TarasNovak/status/1584884282518306817
Probably b/c Polars uses arrow a lot under the hood too. 😄
also, @pitrou, if this is arrow strptime
issue, could you please reopen this ticket?
I would like to to use pyarrow on Windows and this seems like very basic timestamp issue that is core to working with datasets with string timestamps.
Related ticket in my repo: https://github.com/RandomFractals/chicago-crimes/issues/21
@RandomFractals Arrow uses a vendored copy of musl's strptime, where we had to disable all format codes that use langinfo
. So %p
unfortunately won't work on Windows.
If you know of a more complete implementation of strptime
that works on Windows, feel free to suggest it.
@pitrou so, basically, this renders pyarrow useless for devs using Jupyter notebooks on windows with datasets that have string timestamps with AM/PM time parts.
I don't have better suggestions. Sounds like an important internal issue for the core arrow dev team to look into and resolve. Thanks for your explanation of it.
@RandomFractals Arrow is juggling several priorities. For instance we have several open strptime issues. However it's good to know there's interest for this particular issue to be fixed so that it can be prioritised when someone has time to work in this space.
Wouldn't it be cleaner to add a locale parameter to strptime?
pyarrow.compute.strptime(strings, /, format, unit, error_is_null=False, *, options=None, memory_pool=None)
strftime does support a locale parameter to convert timestamps to string
pyarrow.compute.strftime(timestamps, /, format='%Y-%m-%dT%H:%M:%S', locale='C', *, options=None, memory_pool=None)¶
This would also need to be added to pyarrow.csv.ConvertOptions. The timestamp_parsers currently is just a list of strptime format strings without any type of locale.
_timestampparsers A sequence of strptime()-compatible format strings, tried in order when attempting to infer or convert timestamp values (the special value ISO8601() can also be given). By default, a fast built-in ISO-8601 parser is used.
I got a temporary solution which modifies pyarrow.compute.strptime(),
Here's the completed code.. edit - removed pc.and_() expressions.
import pyarrow.compute as pc
import pyarrow as pa
old_strpttime = pc.strptime
def new_strptime(*args, **kwargs):
if 'format' in kwargs and '%p' in kwargs['format'] and '%I' in kwargs['format']:
## is_am = pc.count_substring(args[0], 'AM').cast("bool")
is_am = pc.count_substring(args[0], 'AM')
kwargs['format'] = kwargs['format'].replace('%p','')
strings = pc.replace_substring(pc.replace_substring(args[0], 'AM', ''), 'PM', '')
new_timestamps = old_strpttime(strings, **kwargs)
hour_adjustments = pc.case_when(pc.make_struct(
## pc.and_(is_am, pc.equal(pc.hour(new_timestamps), 12)),
## pc.and_(pc.invert(is_am), pc.equal(pc.hour(new_timestamps), 12)),
## is_am,
## pc.invert(is_am)
pc.equal(pc.hour(new_timestamps), 12),
),
## -12 * 3600,
## 0,
## 0,
## 12 * 3600
pc.multiply(is_am, -12 * 3600),
pc.multiply(pc.add(is_am, -1), -12 * 3600)
)
## hour_adjustments = pa.cast(hour_adjustments, "duration[s]")
hour_adjustments = hour_adjustments.cast("int64").cast("duration[s]")
final_timestamps = pc.add(new_timestamps, hour_adjustments)
return final_timestamps
else:
return old_strpttime(*args, **kwargs)
pc.strptime = new_strptime
Testing results:
>>> _ts = ["9/03/2023 12:35 AM", "9/03/2023 12:35 PM", "9/03/2023 6:35 AM", "9/03/2023 6:35 PM"]
>>> _format = "%d/%m/%Y %I:%M %p"
>>>
>>> pc.strptime(_ts, format= _format, unit='s')
<pyarrow.lib.TimestampArray object at 0x00000206B9888D00>
[
2023-03-09 00:35:00,
2023-03-09 12:35:00,
2023-03-09 06:35:00,
2023-03-09 18:35:00
]
removed pc.and_() expressions. For some reasons using this strptime replacement function with the dataset api produces the following error
>>> dataset.to_table(columns=columns)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow\_dataset.pyx", line 537, in pyarrow._dataset.Dataset.to_table
File "pyarrow\_dataset.pyx", line 383, in pyarrow._dataset.Dataset.scanner
File "pyarrow\_dataset.pyx", line 3202, in pyarrow._dataset.Scanner.from_dataset
File "pyarrow\_dataset.pyx", line 3120, in pyarrow._dataset.Scanner._make_scan_options
File "pyarrow\_dataset.pyx", line 3067, in pyarrow._dataset._populate_builder
File "pyarrow\error.pxi", line 119, in pyarrow.lib.check_status
pyarrow.lib.ArrowKeyError: No function registered with name: and_
I have a csv file with datetime formatted as "%d/%m/%Y %I:%M %p". I can convert them to pandas datetime but when I use pyarrow's strptime I get the following error.
ArrowInvalid: Failed to parse string: '9/03/2043 12:35 AM' as a scalar of type timestamp[s]
Code to reproduce: