apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
321 stars 64 forks source link

Pyarrow filter pushdowns #735

Closed Michael-J-Ward closed 1 week ago

Michael-J-Ward commented 1 week ago

Which issue does this PR close?

Closes #703.

Rationale for this change

The conversion for IsNull had a bug.

datafusion-python users requested pyarrow predicate pushdown support for temporal types.

What changes are included in this PR?

IsNull bug The conversion was incorrectly passing the column-expression as an argument to the pyarrow method is_null. This would silently fail and the predicate would be excluded from the plan.

The argument should be a scalar for nan_is_null. I do not currently have a way for users to pass that in, so please suggest how I might do so.

Temporal Scalars Similar to #731, I used ScalarValue::to_pyarrow for the scalar conversion. pyarrow filters can now accept anything that already has an upstream conversion.

Are there any user-facing changes?

A bugfix and expanded functionality.

Additional Context

I tested the predicate pushdown in two separate ways.

1) Ensuring that explain plan contains the appropriate string. 2) Ensuring that a query on a partitioned dataset doesn't touch the file.

Both of these seem non-ideal. If you have a suggestion for more efficiently testing this, please share!