apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.56k stars 3.54k forks source link

[Python] date64 arrays do not round-trip through pandas conversion #38050

Open spenczar opened 1 year ago

spenczar commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

In PyArrow, Date64Array values do not maintain precision when being loaded from pandas by pa.array.

For example, let's make a date64 array value, and convert it to a pandas Series, taking care to avoid using datetime objects:

import pyarrow as pa
import pandas as pd
date64_array = pa.array([1, 2, 3], pa.date64())
date64_pd = date64_array.to_pandas(date_as_object=False)

# Now load it back in:
date64_roundtripped = pa.array(date64_pd, pa.date64())

# It ought to be unchanged - but its not, this assertion fails:
assert date64_roundtripped == date64_array

If one prints pc.subtract(date64_roundtripped, date64_array), you can see that they are different:

<pyarrow.lib.DurationArray object at 0x10537f160>
[
  -1,
  -2,
  -3
]

Note that this does not occur for date32:

import pyarrow as pa
import pandas as pd
date32_array = pa.array([1, 2, 3], pa.date32())
date32_pd = date32_array.to_pandas(date_as_object=False)

date32_roundtripped = pa.array(pandas, pa.date32())

# just fine:
assert date32_roundtripped == date32_array

It appears to me that date64_pd is just fine. It prints as this:

0   1970-01-01 00:00:00.001
1   1970-01-01 00:00:00.002
2   1970-01-01 00:00:00.003
dtype: datetime64[ns]

One hint at whats going on is to use pa.Array.from_pandas. That actually returns a `TimestampArray:

In [31]: pa.Array.from_pandas(date64_array)
Out[31]:
<pyarrow.lib.TimestampArray object at 0x12d434d60>
[
  1970-01-01 00:00:00.001000000,
  1970-01-01 00:00:00.002000000,
  1970-01-01 00:00:00.003000000
]

The issue might be that conversion from TimestampArray to Date64 array drops precision, maybe.

Component(s)

Python

jorisvandenbossche commented 1 year ago

Interesting. This is because the date64 data type under the hood stores the data as milliseconds, however, because it is a "date", it should not actually take into account any milliseconds that is not a multiple of an exact day.

After roundtripping, the data has become more "correct":

In [24]: date64_array.cast("int64")
Out[24]: 
<pyarrow.lib.Int64Array object at 0x7f67e06dbee0>
[
  1,
  2,
  3
]

In [25]: date64_roundtripped.cast("int64")
Out[25]: 
<pyarrow.lib.Int64Array object at 0x7f67e009f040>
[
  0,
  0,
  0
]

But as long as we allow to store milliseconds that are not a multiple of a single day, then we should also ignore those sub-day milliseconds in operations like equality. For example, date64_roundtripped == date64_array should evaluate to True, even though the underlying values are not equal.

Our format spec says:

/// Date is either a 32-bit or 64-bit signed integer type representing an
/// elapsed time since UNIX epoch (1970-01-01), stored in either of two units:
///
/// * Milliseconds (64 bits) indicating UNIX time elapsed since the epoch (no
///   leap seconds), where the values are evenly divisible by 86400000
/// * Days (32 bits) since the UNIX epoch

So the question is whether we should always truncate the values when creating, or rather deal with sub-day milliseconds later on.

spenczar commented 1 year ago

I see! Thanks for the thorough explanation.

My opinion is that logical data types like date64 should be a semantic layer on top of the physical data. I think that PyArrow should accept the possibility that the physical data doesn't conform to its semantic expectations, so it should be able to work with data with sub-day milliseconds, especially if they come from some foreign, non-pyarrow source.

I think that means that equality should be changed, like you say, since that's a semantic statement. But always truncating the physical data seems too extreme - I'd prefer that PyArrow preserve whatever it was given. Maybe constructors from "raw" sources (Python lists, maybe Pandas Series) should truncate, though?

Anyway - I think I agree that the compute logic should change. It seems likely that many compute operations would need to change, though. For example, all the hash operations - would we need to always truncate before any compute operator is applied?

jorisvandenbossche commented 1 year ago

Anyway - I think I agree that the compute logic should change. It seems likely that many compute operations would need to change, though. For example, all the hash operations - would we need to always truncate before any compute operator is applied?

Yeah, and for that reason, it might make more sense to always truncate when constructing from external sources (even for numpy arrays), so that within arrow, we can assume that it's always a multiple, and don't have to check for this in every kernel

jorisvandenbossche commented 1 year ago

Especially for constructors from python objects, that wouldn't be zero-copy anyway (like your initial example of pa.array([1, 2, 3], pa.date64())), should definitely correctly truncate on construction IMO