Open andygrove opened 2 months ago
The Display
implementation for ScalarValue
changed between DataFusion 37 (the version that Ballista is using) and the version that Comet version. In the older version, Date32 is shown as an integer literal and now it is shown as a date.
I tested a prototype of optimizing this filter and saw a 7% improvement in filter time for this query. It seems worth implementing.
This might work ok for tpc-h but tpc-ds data has nulls and the null check is required perhaps? Does ballista know about the nullability of the data?
This might work ok for tpc-h but tpc-ds data has nulls and the null check is required perhaps? Does ballista know about the nullability of the data?
Yes, the TPC-H data in this case is known not to contain nulls, as shown in the Parquet schema below, so the IsNotNull
check here is redundant. For TPC-DS where the schema allows nulls, we would still need the check.
$ bdt schema lineitem.parquet/
+-----------------+-------------------+-------------+
| column_name | data_type | is_nullable |
+-----------------+-------------------+-------------+
| l_orderkey | Int64 | NO |
| l_partkey | Int64 | NO |
| l_suppkey | Int64 | NO |
| l_linenumber | Int32 | NO |
| l_quantity | Decimal128(11, 2) | NO |
| l_extendedprice | Decimal128(11, 2) | NO |
| l_discount | Decimal128(11, 2) | NO |
| l_tax | Decimal128(11, 2) | NO |
| l_returnflag | Utf8 | NO |
| l_linestatus | Utf8 | NO |
| l_shipdate | Date32 | NO |
| l_commitdate | Date32 | NO |
| l_receiptdate | Date32 | NO |
| l_shipinstruct | Utf8 | NO |
| l_shipmode | Utf8 | NO |
| l_comment | Utf8 | NO |
+-----------------+-------------------+-------------+
What is the problem the feature request solves?
I am comparing native query plans between Comet and Ballista for TPC-H q1 and noticed a significant difference between the filter expressions ~and performance~:
Comet (~total filter time 7.2 seconds~):
Ballista (~total filter time 3.3 seconds~):
The differences are:
We can likely improve Comet performance by eliding the redundant IsNotNull and And. I am not sure if there is a difference with the date versus int literal, but we should check.
Describe the potential solution
No response
Additional context
No response