apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
320 stars 63 forks source link

Incorrect conversion of pyarrow interval value to datafusion literal #665

Closed timsaucer closed 2 weeks ago

timsaucer commented 1 month ago

Describe the bug When creating a literal interval value from a pyarrow scalar, the values for month, day, and nanoseconds are not correctly assigned in the literal values. The following minimal example will reproduce. This appears to be limited to datafusion-python and not the rust implementation.

To Reproduce

print("Setting 1 month interval:")
pa_interval = pa.scalar((1, 0, 0), type=pa.month_day_nano_interval())
print("pa_interval:", pa_interval)

lit_interval = lit(pa_interval)
print("lit_interval:", lit_interval)

df.select(lit_interval).limit(1).show()

print("Setting 1 day interval:")
pa_interval = pa.scalar((0, 1, 0), type=pa.month_day_nano_interval())
print("pa_interval:", pa_interval)

lit_interval = lit(pa_interval)
print("lit_interval:", lit_interval)

df.select(lit_interval).limit(1).show()

print("Setting 1 nanosecond interval:")
pa_interval = pa.scalar((0, 0, 1), type=pa.month_day_nano_interval())
print("pa_interval:", pa_interval)

lit_interval = lit(pa_interval)
print("lit_interval:", lit_interval)

df.select(lit_interval).limit(1).show()

Produces the following result:

Setting 1 month interval:
pa_interval: MonthDayNano(months=1, days=0, nanoseconds=0)
lit_interval: Expr(IntervalMonthDayNano("1"))
DataFrame()
+-------------------------------------------------------+
| IntervalMonthDayNano("1")                             |
+-------------------------------------------------------+
| 0 years 0 mons 0 days 0 hours 0 mins 0.000000001 secs |
+-------------------------------------------------------+
Setting 1 day interval:
pa_interval: MonthDayNano(months=0, days=1, nanoseconds=0)
lit_interval: Expr(IntervalMonthDayNano("4294967296"))
DataFrame()
+-------------------------------------------------------+
| IntervalMonthDayNano("4294967296")                    |
+-------------------------------------------------------+
| 0 years 0 mons 0 days 0 hours 0 mins 4.294967296 secs |
+-------------------------------------------------------+
Setting 1 nanosecond interval:
pa_interval: MonthDayNano(months=0, days=0, nanoseconds=1)
lit_interval: Expr(IntervalMonthDayNano("18446744073709551616"))
DataFrame()
+-------------------------------------------------------+
| IntervalMonthDayNano("18446744073709551616")          |
+-------------------------------------------------------+
| 0 years 0 mons 1 days 0 hours 0 mins 0.000000000 secs |
+-------------------------------------------------------+

Expected behavior When setting an interval value of 1 month in pyarrow, it should show up as 1 month in the datafusion data frame, and so on for the other values.

Additional context Add any other context about the problem here.

Michael-J-Ward commented 1 month ago

It's probably related to this issue in arrow-rs: Rust Interval definition is incorrect.

Here's a godbolt link demonstrating the "1 month becomes 1 nanosecond" example. (I based that on a comment in a similar thread in duckdb-wasm).

I would suspect that if all code paths use the same impl, then datafusion-python wouldn't notice it, but perhaps that's wrong, or maybe not all code-paths use arrow-rs?

The error occurs in the pyo3 magic as we cross the python -> rust bridge. Notice the python side assert's pyarrow.Scalar, but then the rust-side receives a datafusion::ScalarValue. (aside: is this magic type conversion intentional?)

python side: https://github.com/apache/datafusion-python/blob/67d4cfb847e42724319aeec9014889262e6ea58a/datafusion/__init__.py#L182-L185

rust side: https://github.com/apache/datafusion-python/blob/67d4cfb847e42724319aeec9014889262e6ea58a/src/expr.rs#L229-L232

The error is already present before the rust-method is invoked, adding print statements on both sides of the bridge:

converting: MonthDayNano(months=1, days=0, nanoseconds=0)
converting: IntervalMonthDayNano("1")
timsaucer commented 1 month ago

TODO: If the PR https://github.com/apache/datafusion-python/pull/666 merges in before this issues is corrected, the following examples in the examples/tpch folder will need to be updated