apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
748 stars 145 forks source link

Optimize filters to remove redundant IsNotNull checks #938

Open andygrove opened 1 week ago

andygrove commented 1 week ago

What is the problem the feature request solves?

I am comparing native query plans between Comet and Ballista for TPC-H q1 and noticed a significant difference between the filter expressions ~and performance~:

Comet (~total filter time 7.2 seconds~):

FilterExec: col_6@6 IS NOT NULL AND col_6@6 <= 1998-09-24

Ballista (~total filter time 3.3 seconds~):

FilterExec: l_shipdate@6 <= 10493

The differences are:

We can likely improve Comet performance by eliding the redundant IsNotNull and And. I am not sure if there is a difference with the date versus int literal, but we should check.

Describe the potential solution

No response

Additional context

No response

andygrove commented 1 week ago

The Display implementation for ScalarValue changed between DataFusion 37 (the version that Ballista is using) and the version that Comet version. In the older version, Date32 is shown as an integer literal and now it is shown as a date.

andygrove commented 1 week ago

I tested a prototype of optimizing this filter and saw a 7% improvement in filter time for this query. It seems worth implementing.

parthchandra commented 2 days ago

This might work ok for tpc-h but tpc-ds data has nulls and the null check is required perhaps? Does ballista know about the nullability of the data?