TPCDS queries on Gluten+Velox in EMR is considerably slower than OSS Spark

sagarlakshmipathy commented 6 months ago

Backend

VL (Velox)

Bug description

[Expected behavior] Faster query runs compared to OSS Spark [actual behavior] OSS Spark runs in half the time taken by Gluten+Velox Spark.

Spark version

None

Spark configurations

Gluten+Velox+Spark

./spark-3.4.1-bin-hadoop3/bin/spark-shell --master yarn --deploy-mode client --driver-memory 19g --executor-memory 19g --executor-cores 5 --num-executors 32 --jars /home/hadoop/hudi-spark3.4-bundle_2.12-0.14.1.jar,/home/hadoop/hudi-benchmarks-0.1-SNAPSHOT.jar --packages org.apache.hadoop:hadoop-aws:3.2.4 --conf spark.plugins=io.glutenproject.GlutenPlugin --conf spark.memory.offHeap.enabled=true --conf spark.memory.offHeap.size=30g --conf spark.shuffler=org.apache.spark.shuffle.sort.ColumnarShuffleManager --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog --conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain --conf spark.sql.catalogImplementation=in-memory --conf spark.ui.proxyBase="" --conf 'spark.eventLog.enabled=true' --conf 'spark.eventLog.dir=hdfs:///var/log/spark/apps'

OSS Spark

./spark-3.4.1-bin-hadoop3/bin/spark-shell --master yarn --deploy-mode client --driver-memory 19g --executor-memory 19g --executor-cornum-executors 32 --jars /home/hadoop/hudi-spark3.4-bundle_2.12-0.14.1.jar,/home/hadoop/hudi-benchmarks-0.1-SNAPSHOT.jar --packages org.apache.hadoop:hadoop-aws:3.2.4 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog --conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain --conf spark.sql.catalogImplementation=in-memory --conf spark.ui.proxyBase="" --conf 'spark.eventLog.enabled=true' --conf 'spark.eventLog.dir=hdfs:///var/log/spark/apps'

System information

Environment: Amazon EMR - 10 workers, 1 driver all m5.4xlarge OS: Amazon Linux 2

Relevant logs

Wondering what you need me to capture that'll help you

zhouyuan commented 6 months ago

Hi @sagarlakshmipathy Can you please also share the performance number per query? on TPCDS the Q72 is still a trouble for gluten and needs some special config. Here's some discussions: https://github.com/apache/incubator-gluten/issues/1775

Are you testing with HUDI tables by any chance? --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog For now the HUDI support is not ready in Gluten. It will actually run with vanilla Spark code, and with a RowtoColumn(memcpy) connect to Gluten native operators. So this will actually bring lots of overhead.

thanks, -yuan

sagarlakshmipathy commented 6 months ago

Query ID	Gluten Velox Spark Hudi (ms)	OSS Spark Hudi
1	22040	16699
2	60531	33095
3	61031	25965
4	360561	172286
5	140865	72149
6	48038	22890
7	106637	44359
8	45072	19636

I didn't bother running the rest of them. I am testing Hudi tables with Gluten. Is there a gh issue/discussion I can +1 to?

zhouyuan commented 6 months ago

It is quite likely due to the fallback of scanning HUDI tables. Here's the issue tracker for unified data lake design, ICEBERG and DELTA LAKE are now both supported(not 100%) now. https://github.com/apache/incubator-gluten/issues/3378

Thanks, -yuan

my7ym commented 2 months ago

@sagarlakshmipathy Hey, may I know your setups & configurations for running Gluten on EMR? Thanks!

apache / incubator-gluten