apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.22k stars 439 forks source link

TPCDS queries on Gluten+Velox in EMR is considerably slower than OSS Spark #4940

Open sagarlakshmipathy opened 8 months ago

sagarlakshmipathy commented 8 months ago

Backend

VL (Velox)

Bug description

[Expected behavior] Faster query runs compared to OSS Spark [actual behavior] OSS Spark runs in half the time taken by Gluten+Velox Spark.

Spark version

None

Spark configurations

Gluten+Velox+Spark

./spark-3.4.1-bin-hadoop3/bin/spark-shell --master yarn --deploy-mode client --driver-memory 19g --executor-memory 19g --executor-cores 5 --num-executors 32 --jars /home/hadoop/hudi-spark3.4-bundle_2.12-0.14.1.jar,/home/hadoop/hudi-benchmarks-0.1-SNAPSHOT.jar --packages org.apache.hadoop:hadoop-aws:3.2.4 --conf spark.plugins=io.glutenproject.GlutenPlugin --conf spark.memory.offHeap.enabled=true --conf spark.memory.offHeap.size=30g --conf spark.shuffler=org.apache.spark.shuffle.sort.ColumnarShuffleManager --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog --conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain --conf spark.sql.catalogImplementation=in-memory --conf spark.ui.proxyBase="" --conf 'spark.eventLog.enabled=true' --conf 'spark.eventLog.dir=hdfs:///var/log/spark/apps'

OSS Spark

./spark-3.4.1-bin-hadoop3/bin/spark-shell --master yarn --deploy-mode client --driver-memory 19g --executor-memory 19g --executor-cornum-executors 32 --jars /home/hadoop/hudi-spark3.4-bundle_2.12-0.14.1.jar,/home/hadoop/hudi-benchmarks-0.1-SNAPSHOT.jar --packages org.apache.hadoop:hadoop-aws:3.2.4 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog --conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain --conf spark.sql.catalogImplementation=in-memory --conf spark.ui.proxyBase="" --conf 'spark.eventLog.enabled=true' --conf 'spark.eventLog.dir=hdfs:///var/log/spark/apps'

System information

Environment: Amazon EMR - 10 workers, 1 driver all m5.4xlarge OS: Amazon Linux 2

Relevant logs

Wondering what you need me to capture that'll help you
zhouyuan commented 8 months ago

Hi @sagarlakshmipathy Can you please also share the performance number per query? on TPCDS the Q72 is still a trouble for gluten and needs some special config. Here's some discussions: https://github.com/apache/incubator-gluten/issues/1775

Are you testing with HUDI tables by any chance? --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog For now the HUDI support is not ready in Gluten. It will actually run with vanilla Spark code, and with a RowtoColumn(memcpy) connect to Gluten native operators. So this will actually bring lots of overhead.

thanks, -yuan

sagarlakshmipathy commented 8 months ago
Query ID Gluten Velox Spark Hudi (ms) OSS Spark Hudi
1 22040 16699
2 60531 33095
3 61031 25965
4 360561 172286
5 140865 72149
6 48038 22890
7 106637 44359
8 45072 19636

I didn't bother running the rest of them. I am testing Hudi tables with Gluten. Is there a gh issue/discussion I can +1 to?

zhouyuan commented 8 months ago

It is quite likely due to the fallback of scanning HUDI tables. Here's the issue tracker for unified data lake design, ICEBERG and DELTA LAKE are now both supported(not 100%) now. https://github.com/apache/incubator-gluten/issues/3378

Thanks, -yuan

my7ym commented 4 months ago

@sagarlakshmipathy Hey, may I know your setups & configurations for running Gluten on EMR? Thanks!