Open andygrove opened 4 months ago
Here are some notes comparing the catalog_sales scan+filter+exchange+sort
Spark produces the worst possible query plan for q72 which amplifies the difference in performance. The C2R overhead for comet is amplified because the conversion happens on a dataset that is larger than the source data. To get a reasonable query plan for q72 we need to have spark.sql.cbo.enabled
and spark.sql.cbo.joinReorder.enabled
set to true. This also requires stats.
Also perhaps we can try with a larger broadcast threshold (spark.sql.autoBroadcastJoinThreshold
)?
Irrespective of the plan though, given the same number of input rows are the Comet operators also slower than the corresponding Spark operators?
Spark produces the worst possible query plan for q72
Yes, it does. I am comparing like-for-like plans between Spark and Comet without any join reordering enabled.
Irrespective of the plan though, given the same number of input rows are the Comet operators also slower than the corresponding Spark operators?
In both cases, Spark is executing the SortMergeJoin and the join takes longer when the inputs are from CometScan/CometFilter/CometExchange than if they are from the Spark equivalents (with same number of rows in both cases).
Things I have learned since filing this issue:
More learnings:
What is the problem the feature request solves?
I ran our benchmark derived from TPC-DS @ sf=100 locally and saw that q72 shows the largest regression (measured in seconds rather than percentage) and was 754 seconds (12.5 minutes) slower with Comet enabled. Spark took 1.1 hours, and Comet took 1.3 hours.
This was based on a single run of all 99 queries in Spark and then again with Comet enabled.
Comet does not currently support the many sort-merge joins in the query, so Comet is only performing the initial file scans, filters, and exchanges (and sometimes sorts) before transitioning back to Spark for the joins.
This issue is for discussing possible solutions to avoid this regression.
Describe the potential solution
No response
Additional context
No response