apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
825 stars 164 forks source link

with datafusion comet,no performance improvement. #1084

Open shaileneF opened 1 week ago

shaileneF commented 1 week ago

env:

host
os:CentOS Linux release 8.2.2004 (Core)
kernel:4.18.0-193.el8.x86_64
memory:1T
jdk:1.8
maven:3.9.6
spark:3.4
scala:2.12
container:
os:CentOS Linux release 7.4.1708 (Core)
kernel:4.18.0-193.el8.x86_64
cpu cores:128
spark: 3.4.3
memory:1T
jdk:11
maven:3.9.6
spark:3.4
scala:2.12

data:TPCDS 100G/1T

with datafusion comet, spark-submit shell:

export COMET_JAR=/export/datafusion-test/comet-spark-spark3.4_2.12-0.3.0.jar

$SPARK_HOME/bin/spark-submit \
    --master local \
    --name comet-tpcbench \
    --driver-memory 20G \
    --conf spark.driver.memory=20G \
    --conf spark.executor.instances=16 \
    --conf spark.executor.memory=40G \
    --conf spark.executor.cores=8 \
    --conf spark.cores.max=128 \
    --conf spark.task.cpus=1 \
    --conf spark.executor.memoryOverhead=3G \
    --conf spark.memory.offHeap.enabled=true \
    --conf spark.memory.offHeap.size=40G \
    --jars $COMET_JAR \
    --conf spark.executor.extraClassPath=$COMET_JAR \
    --conf spark.driver.extraClassPath=$COMET_JAR \
    --conf spark.plugins=org.apache.spark.CometPlugin \
    --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
    --conf spark.comet.enabled=true \
    --conf spark.comet.exec.enabled=true \
    --conf spark.comet.exec.all.enabled=true \
    --conf spark.comet.cast.allowIncompatible=true \
    --conf spark.comet.exec.shuffle.enabled=true \
    --conf spark.comet.exec.shuffle.mode=auto \
    --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
    /export/datafusion-test/datafusion-benchmarks/runners/datafusion-comet/tpcbench.py \
    --benchmark tpcds \
    --data /export/cy/test-data-100G-1024/ \
    --queries /export/datafusion-test/datafusion-benchmarks/tpcds/queries-spark \
    --output /export/datafusion-test/output \
    --iterations 2

without datafusion comet, spark-submit shell:

$SPARK_HOME/bin/spark-submit \
    --master local \
    --name comet-tpcbench \
    --driver-memory 20G \
    --conf spark.driver.memory=20G \
    --conf spark.executor.instances=16 \
    --conf spark.executor.memory=80G \
    --conf spark.executor.cores=8 \
    --conf spark.cores.max=128 \
    --conf spark.task.cpus=1 \
    --conf spark.executor.memoryOverhead=3G \
    --conf spark.memory.offHeap.enabled=false \
    /export/datafusion-test/datafusion-benchmarks/runners/datafusion-comet/tpcbench.py \
    --benchmark tpcds \
    --data /export/cy/test-data-100G-1024/ \
    --queries /export/datafusion-test/datafusion-benchmarks/tpcds/queries-spark \
    --output /export/datafusion-test/output \
    --iterations 2

description :

hello,I run spark+datafusion comet+tpcds in local model. Whether config master is set to local or local[*], DataFusion Comet does not significantly improve performance, and there are even many queries that result in negative gains. Could you please help check if my configuration is incorrect? I tested with 100GB and 1TB TPC-DS datasets, and the performance improvement with DataFusion Comet is very low, with the total query duration improving by only about 6%. My container specifications are 128 cores and 1TB of memory.🙏🙏🙏

andygrove commented 1 week ago

Hi @shaileneF Are you testing with the 0.3.0 release or the latest from the main branch? I am going to be running benchmarks today and tomorrow in preparation for the 0.4.0 release so will share my results with you.

shaileneF commented 1 week ago

Hi @shaileneF Are you testing with the 0.3.0 release or the latest from the main branch? I am going to be running benchmarks today and tomorrow in preparation for the 0.4.0 release so will share my results with you.

Yes,0.3.0,I download the release jar from https://datafusion.apache.org/comet/user-guide/installation.html. thank you for running the benchmark. I want to know my spark-submit config is right or not.

andygrove commented 1 week ago

One more question @shaileneF ... is your data set partitioned by date?

shaileneF commented 5 days ago

One more question @shaileneF ... is your data set partitioned by date? the dataset was partitioned during generation, but it was not partitioned by date.

shaileneF commented 5 days ago

One more question @shaileneF ... is your data set partitioned by date?

Here is the dataset generation shell. https://github.com/apache/incubator-gluten/tree/main/tools/workload/tpcds/gen_data/parquet_dataset