apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
615 stars 113 forks source link

[EPIC] Support native execution for all TPC-H queries #391

Open andygrove opened 1 month ago

andygrove commented 1 month ago

What is the problem the feature request solves?

We currently fall back to Spark for parts of TPC-H. This epic is to track work to support these features in the context of the TPC-H queries so that we can start to get some benchmark results.

Status

Updated 5/30/2024 based on PR with BuildRight support and SMJ enabled.

Query Status
q1 Runs natively
q2 Runs natively
q3 Runs natively
q4 Runs natively
q5 #344
q6 Runs natively
q7 #344
q8 #344
q9 Runs natively
q10 Runs natively
q11 Runs natively
q12 Runs natively
q13 Runs natively
q14 Runs natively
q15 Runs natively
q16 https://github.com/apache/datafusion-comet/issues/457
q17 #398
q18 Runs natively
q19 #398
q20 #344
q21 #344, #398
q22 Runs natively

Original configs used (ignore this now)

Used Comet as of commit hash bc6b2cda3efd2b0c6c48f932ce19da46456bcbd5.

Configs used:

--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions
--conf spark.comet.enabled=true
--conf spark.comet.explainFallback.enabled=true
--conf spark.comet.exec.enabled=true
--conf spark.comet.exec.all.enabled=true
--conf spark.comet.exec.all.expr.enabled=true
--conf spark.comet.cast.allowIncompatible=true
--conf spark.comet.exec.broadcast.enabled=true
--conf spark.comet.exec.shuffle.enabled=true
--conf spark.comet.columnar.shuffle.enabled=true
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
--conf spark.comet.shuffle.enforceMode.enabled=true
--conf spark.sql.adaptive.coalescePartitions.enabled=false

Describe the potential solution

No response

Additional context

No response

viirya commented 1 month ago

BroadcastExchange should be supported, I think. We have CometBroadcastExchange.

We don't need to support AQEShuffleRead. It is a shuffle reader wrapper in Spark. It calls wrapped shuffle's execute or executeColumnar depending on it is columnar or not.

viirya commented 1 month ago

We don't need to support Execute CreateViewCommand too. It is a command exec operator.

viirya commented 1 month ago

Also CommandResult, which is only used to hold data from a command. CommandResult and Execute CreateViewCommand are not query execution operators.

andygrove commented 1 month ago

Also CommandResult, which is only used to hold data from a command. CommandResult and Execute CreateViewCommand are not query execution operators.

Thanks. I saw those from the CREATE VIEW in q15 but I see from the Spark UI that the SELECT part of this query is already fully native. I have removed those from the list.

andygrove commented 1 month ago

BroadcastExchange should be supported, I think. We have CometBroadcastExchange.

BroadcastExchange is not supported is the information that Comet provides for q8. I think part of this epic will be making these messages more informative.

viirya commented 1 month ago

For Sort merge join with a join condition, I added the support to DataFusion for a while but we've not incorporated the feature in Comet yet. I opened #398 to track it and I will work on it once #250 is merged and #248 is done.

viirya commented 1 month ago

BroadcastExchange is not supported is the information that Comet provides for q8. I think part of this epic will be making these messages more informative.

I will take a look at q8 and see why it is not enabled there.

andygrove commented 1 month ago

I will take a look at q8 and see why it is not enabled there.

The error BroadcastExchange is not supported really means BroadcastExchange is not supported because the child operators are not supported

viirya commented 1 month ago

Please disable spark.comet.exec.broadcast.enabled which should not be used in normal query: https://github.com/apache/datafusion-comet/issues/408#issuecomment-2104818958