Open spenczar opened 1 year ago
Interesting. This is because the date64
data type under the hood stores the data as milliseconds, however, because it is a "date", it should not actually take into account any milliseconds that is not a multiple of an exact day.
After roundtripping, the data has become more "correct":
In [24]: date64_array.cast("int64")
Out[24]:
<pyarrow.lib.Int64Array object at 0x7f67e06dbee0>
[
1,
2,
3
]
In [25]: date64_roundtripped.cast("int64")
Out[25]:
<pyarrow.lib.Int64Array object at 0x7f67e009f040>
[
0,
0,
0
]
But as long as we allow to store milliseconds that are not a multiple of a single day, then we should also ignore those sub-day milliseconds in operations like equality. For example, date64_roundtripped == date64_array
should evaluate to True, even though the underlying values are not equal.
Our format spec says:
/// Date is either a 32-bit or 64-bit signed integer type representing an
/// elapsed time since UNIX epoch (1970-01-01), stored in either of two units:
///
/// * Milliseconds (64 bits) indicating UNIX time elapsed since the epoch (no
/// leap seconds), where the values are evenly divisible by 86400000
/// * Days (32 bits) since the UNIX epoch
So the question is whether we should always truncate the values when creating, or rather deal with sub-day milliseconds later on.
I see! Thanks for the thorough explanation.
My opinion is that logical data types like date64 should be a semantic layer on top of the physical data. I think that PyArrow should accept the possibility that the physical data doesn't conform to its semantic expectations, so it should be able to work with data with sub-day milliseconds, especially if they come from some foreign, non-pyarrow source.
I think that means that equality should be changed, like you say, since that's a semantic statement. But always truncating the physical data seems too extreme - I'd prefer that PyArrow preserve whatever it was given. Maybe constructors from "raw" sources (Python lists, maybe Pandas Series) should truncate, though?
Anyway - I think I agree that the compute logic should change. It seems likely that many compute operations would need to change, though. For example, all the hash operations - would we need to always truncate before any compute operator is applied?
Anyway - I think I agree that the compute logic should change. It seems likely that many compute operations would need to change, though. For example, all the hash operations - would we need to always truncate before any compute operator is applied?
Yeah, and for that reason, it might make more sense to always truncate when constructing from external sources (even for numpy arrays), so that within arrow, we can assume that it's always a multiple, and don't have to check for this in every kernel
Especially for constructors from python objects, that wouldn't be zero-copy anyway (like your initial example of pa.array([1, 2, 3], pa.date64())
), should definitely correctly truncate on construction IMO
Describe the bug, including details regarding any error messages, version, and platform.
In PyArrow, Date64Array values do not maintain precision when being loaded from pandas by
pa.array
.For example, let's make a date64 array value, and convert it to a pandas Series, taking care to avoid using datetime objects:
If one prints
pc.subtract(date64_roundtripped, date64_array)
, you can see that they are different:Note that this does not occur for date32:
It appears to me that
date64_pd
is just fine. It prints as this:One hint at whats going on is to use
pa.Array.from_pandas
. That actually returns a `TimestampArray:The issue might be that conversion from TimestampArray to Date64 array drops precision, maybe.
Component(s)
Python