Open nevi-me opened 3 years ago
Another interesting thing that could be an optimisation, the CAST(mongo_nyc.trip_distance AS Float64)
could be pushed to the source altogether, as it now gets evaluated by both the source and datafusion.
This would work very well with SQL sources.
Most likely because our current filter push down implementation determine which filter to preserve by accessed columns, not by whether the exact filter has been pushed down or not: https://github.com/apache/arrow-datafusion/blob/f24e45fc8ec035e9ec0f6b6a18bb97e5bc0f9a1c/datafusion/src/optimizer/filter_push_down.rs#L474
found the pushdown, needs some improvments, here is the use case i faced: i have a table with some binary data and a column giving the type.
a first view filter the records of type "int" a second view based on the former one cast the binary and make some where clause filter to the records
in the explain plan, the where clause of the second view is pushed down, and fail because all records cannot be cast in "int"
Describe the bug
If using a data source that supports exact filters, we can duplicate filters if not all filters can be pushed down to the source. This happens if we have multiple filters on the same column, but one or more of those filters cannot be pushed to source.
To Reproduce
Using https://github.com/TheDataEngine/datafusion-mongo-connector, and the below SQL query on the NYC dataset:
I am able to pass down the following filters in the where clause:
passenger_count > 3
andtotal_amount < 20.0
cast(trip_distance as float) < 5.00
fare_amount / (total_amount + 0.001) > 0.70
passenger_count is not null
VendorID in ('2', '4')
I don't yet support pushing the below:
-passenger_count < -2
If the query includes the above unsupported filter, the other exact filters are duplicated.
If the query excludes the above negative filter, all filters are pushed down to the source.
The
passenger_count
filters that are pushed to the source, are also evaluated by datafusion.Expected behavior
Given that the filters are
AND
, I would expect datafusion to only evaluate the negated condition, as the other conditions (not null, > 3) would be redundant.Additional context
I'm aware that constant folding will simplify
passenger_count > 3 and -passenger_count < -2
to:pc > 3 and pc >= 2
pc >= 2
but before then, we are performing a few redundant calculations because of the duplicated filters.