apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
826 stars 164 forks source link

Optimize filters to remove redundant IsNotNull checks #938

Open andygrove opened 2 months ago

andygrove commented 2 months ago

What is the problem the feature request solves?

I am comparing native query plans between Comet and Ballista for TPC-H q1 and noticed a significant difference between the filter expressions ~and performance~:

Comet (~total filter time 7.2 seconds~):

FilterExec: col_6@6 IS NOT NULL AND col_6@6 <= 1998-09-24

Ballista (~total filter time 3.3 seconds~):

FilterExec: l_shipdate@6 <= 10493

The differences are:

We can likely improve Comet performance by eliding the redundant IsNotNull and And. I am not sure if there is a difference with the date versus int literal, but we should check.

Describe the potential solution

No response

Additional context

No response

andygrove commented 2 months ago

The Display implementation for ScalarValue changed between DataFusion 37 (the version that Ballista is using) and the version that Comet version. In the older version, Date32 is shown as an integer literal and now it is shown as a date.

andygrove commented 2 months ago

I tested a prototype of optimizing this filter and saw a 7% improvement in filter time for this query. It seems worth implementing.

parthchandra commented 2 months ago

This might work ok for tpc-h but tpc-ds data has nulls and the null check is required perhaps? Does ballista know about the nullability of the data?

andygrove commented 2 months ago

This might work ok for tpc-h but tpc-ds data has nulls and the null check is required perhaps? Does ballista know about the nullability of the data?

Yes, the TPC-H data in this case is known not to contain nulls, as shown in the Parquet schema below, so the IsNotNull check here is redundant. For TPC-DS where the schema allows nulls, we would still need the check.

$ bdt schema lineitem.parquet/
+-----------------+-------------------+-------------+
| column_name     | data_type         | is_nullable |
+-----------------+-------------------+-------------+
| l_orderkey      | Int64             | NO          |
| l_partkey       | Int64             | NO          |
| l_suppkey       | Int64             | NO          |
| l_linenumber    | Int32             | NO          |
| l_quantity      | Decimal128(11, 2) | NO          |
| l_extendedprice | Decimal128(11, 2) | NO          |
| l_discount      | Decimal128(11, 2) | NO          |
| l_tax           | Decimal128(11, 2) | NO          |
| l_returnflag    | Utf8              | NO          |
| l_linestatus    | Utf8              | NO          |
| l_shipdate      | Date32            | NO          |
| l_commitdate    | Date32            | NO          |
| l_receiptdate   | Date32            | NO          |
| l_shipinstruct  | Utf8              | NO          |
| l_shipmode      | Utf8              | NO          |
| l_comment       | Utf8              | NO          |
+-----------------+-------------------+-------------+