with datafusion comet，no performance improvement.

shaileneF commented 1 week ago

env：

host
os：CentOS Linux release 8.2.2004 (Core)
kernel：4.18.0-193.el8.x86_64
memory：1T
jdk：1.8
maven：3.9.6
spark：3.4
scala：2.12
container：
os：CentOS Linux release 7.4.1708 (Core)
kernel：4.18.0-193.el8.x86_64
cpu cores：128
spark: 3.4.3
memory：1T
jdk：11
maven：3.9.6
spark：3.4
scala：2.12

data：TPCDS 100G/1T

with datafusion comet, spark-submit shell:

export COMET_JAR=/export/datafusion-test/comet-spark-spark3.4_2.12-0.3.0.jar

$SPARK_HOME/bin/spark-submit \
    --master local \
    --name comet-tpcbench \
    --driver-memory 20G \
    --conf spark.driver.memory=20G \
    --conf spark.executor.instances=16 \
    --conf spark.executor.memory=40G \
    --conf spark.executor.cores=8 \
    --conf spark.cores.max=128 \
    --conf spark.task.cpus=1 \
    --conf spark.executor.memoryOverhead=3G \
    --conf spark.memory.offHeap.enabled=true \
    --conf spark.memory.offHeap.size=40G \
    --jars $COMET_JAR \
    --conf spark.executor.extraClassPath=$COMET_JAR \
    --conf spark.driver.extraClassPath=$COMET_JAR \
    --conf spark.plugins=org.apache.spark.CometPlugin \
    --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
    --conf spark.comet.enabled=true \
    --conf spark.comet.exec.enabled=true \
    --conf spark.comet.exec.all.enabled=true \
    --conf spark.comet.cast.allowIncompatible=true \
    --conf spark.comet.exec.shuffle.enabled=true \
    --conf spark.comet.exec.shuffle.mode=auto \
    --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
    /export/datafusion-test/datafusion-benchmarks/runners/datafusion-comet/tpcbench.py \
    --benchmark tpcds \
    --data /export/cy/test-data-100G-1024/ \
    --queries /export/datafusion-test/datafusion-benchmarks/tpcds/queries-spark \
    --output /export/datafusion-test/output \
    --iterations 2

without datafusion comet, spark-submit shell:

$SPARK_HOME/bin/spark-submit \
    --master local \
    --name comet-tpcbench \
    --driver-memory 20G \
    --conf spark.driver.memory=20G \
    --conf spark.executor.instances=16 \
    --conf spark.executor.memory=80G \
    --conf spark.executor.cores=8 \
    --conf spark.cores.max=128 \
    --conf spark.task.cpus=1 \
    --conf spark.executor.memoryOverhead=3G \
    --conf spark.memory.offHeap.enabled=false \
    /export/datafusion-test/datafusion-benchmarks/runners/datafusion-comet/tpcbench.py \
    --benchmark tpcds \
    --data /export/cy/test-data-100G-1024/ \
    --queries /export/datafusion-test/datafusion-benchmarks/tpcds/queries-spark \
    --output /export/datafusion-test/output \
    --iterations 2

description ：

hello，I run spark+datafusion comet+tpcds in local model. Whether config master is set to local or local[*], DataFusion Comet does not significantly improve performance, and there are even many queries that result in negative gains. Could you please help check if my configuration is incorrect? I tested with 100GB and 1TB TPC-DS datasets, and the performance improvement with DataFusion Comet is very low, with the total query duration improving by only about 6%. My container specifications are 128 cores and 1TB of memory.🙏🙏🙏

andygrove commented 1 week ago

Hi @shaileneF Are you testing with the 0.3.0 release or the latest from the main branch? I am going to be running benchmarks today and tomorrow in preparation for the 0.4.0 release so will share my results with you.

shaileneF commented 1 week ago

Hi @shaileneF Are you testing with the 0.3.0 release or the latest from the main branch? I am going to be running benchmarks today and tomorrow in preparation for the 0.4.0 release so will share my results with you.

Yes，0.3.0，I download the release jar from https://datafusion.apache.org/comet/user-guide/installation.html. thank you for running the benchmark. I want to know my spark-submit config is right or not.

andygrove commented 1 week ago

One more question @shaileneF ... is your data set partitioned by date?

shaileneF commented 5 days ago

One more question @shaileneF ... is your data set partitioned by date? the dataset was partitioned during generation, but it was not partitioned by date.