apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.14k stars 415 forks source link

[VL] Performance degraded when running with centOS9 build #6943

Open shivangi24 opened 3 weeks ago

shivangi24 commented 3 weeks ago

Backend

VL (Velox)

Bug description

We are currently working on integrating Gluten into our WatsonX.Data's Spark environment. However, after enabling Gluten and running the TPCH benchmark at the 100G scale, we are not observing the performance improvements as claimed in the Gluten repository. Specifically, we are seeing only a 10-12% improvement, whereas a 2x improvement is expected.

Here are the details of our environment:

  1. Gluten was built on CentOS 9.
  2. The built jar and shared libraries are being utilized on Docker images based on UBI-9.
  3. Our Spark application is running with 2 executors, each configured with 6 cores and 24GB of memory.
  4. We are processing TPCH data at a 100G scale, with data stored in Iceberg format.
  5. We are using Java 17.

We have experimented with various configurations, but the performance gain has not exceeded 10-12% across all 22 queries. We have attached a graph showing the performance comparison between runs with and without Gluten.

image

Adding spark events for single query - Q6 f2b74f64-bdfe-42ba-a6f7-ad81028cb2d7_events.zip cc: @deepashreeraghu @majetideepak

Spark version

Spark-3.4.x

Spark configurations

### spark configs for driver and executor 
"spark.executor.cores": "6",
"spark.executor.memory": "24G",
"spark.driver.cores": "6",
"spark.driver.memory": "24G",
"spark.driver.extraClassPath": "/opt/ibm/spark/external-jars/gluten-velox-bundle-spark3.4_2.12-centos_9_x86_64-1.2.0-SNAPSHOT.jar",
"spark.executor.extraClassPath": "/opt/ibm/spark/external-jars/gluten-velox-bundle-spark3.4_2.12-centos_9_x86_64-1.2.0-SNAPSHOT.jar",
"spark.hive.metastore.uris": "thrift://<hive-metastore-URL>",
"spark.sql.defaultCatalog": "lakehouse",
"spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"spark.sql.catalog.lakehouse": "org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.lakehouse.type": "hive",
"spark.sql.iceberg.vectorization.enabled": "false",
"spark.hive.metastore.client.auth.mode": "PLAIN",
"spark.hive.metastore.client.plain.username": "<metastore username>",
"spark.hive.metastore.client.plain.password": "<metastore password>",
"spark.hive.metastore.use.SSL": "true",
"spark.hive.metastore.truststore.type": "JKS",
"spark.hive.metastore.truststore.path": "file:///opt/ibm/jdk/lib/security/cacerts",
"spark.hive.metastore.truststore.password": "changeit",

### Main Gluten configs
"spark.gluten.enabled": "true",
"spark.plugins": "org.apache.gluten.GlutenPlugin",
"spark.shuffle.manager": "org.apache.spark.shuffle.sort.ColumnarShuffleManager",
"spark.gluten.loadLibFromJar": "true",
"spark.gluten.sql.columnar.forceShuffledHashJoin": "true",
"spark.gluten.sql.columnar.backend.lib": "velox",

### Java-related updates
"spark.driver.extraJavaOptions": "-Dio.netty.tryReflectionSetAccessible=true -XX:MaxDirectMemorySize=1G -Djdk.nio.maxCachedBufferSize=262144",
"spark.executor.extraJavaOptions": "-Dio.netty.tryReflectionSetAccessible=true -XX:MaxDirectMemorySize=1G -Djdk.nio.maxCachedBufferSize=262144",

### Fallback-related
"spark.gluten.sql.columnar.joinOptimizationLevel": "18",
"spark.gluten.sql.columnar.physicalJoinOptimizeEnable": "true",
"spark.gluten.sql.columnar.physicalJoinOptimizationLevel": "18",
"spark.gluten.sql.columnar.logicalJoinOptimizeEnable": "true",
"spark.gluten.sql.columnar.logicalJoinOptimizationLevel": "19",
"spark.gluten.sql.columnar.fallback.expressions.threshold": "2",

### Memory-related
"spark.memory.offHeap.enabled": "true",
"spark.executor.memoryOverheadFactor": "75",
"spark.memory.offHeap.size": "18g",

#### AQE-related
"spark.sql.adaptive.enabled": "true",
"spark.gluten.sql.columnar.shuffle.writeEOS": "false",
"spark.gluten.sql.columnar.backend.ch.shuffle.hash.algorithm": "sparkMurmurHash3_32",

### Shuffle and Compression-related
"spark.shuffle.compress": "true",
"spark.gluten.sql.columnar.shuffle.compressionMode": "buffer",
"spark.sql.optimizer.runtime.bloomFilter.enabled": "true",
"spark.sql.optimizer.runtime.bloomFilter.applicationSideScanSizeThreshold": "1KB",
"spark.gluten.sql.columnar.force.hashagg": "false",

Ran with 2 executors of (6*24G)

System information

No response

Relevant logs

We observed that one stage took significantly longer to complete. Could you please investigate the cause of the delay? image

image

zhztheplayer commented 3 weeks ago

Would you locate the stage on SQL UI ? Then we can see some basic metrics.

It's likely related to scan's fallback and possibly caused by either slow scan or slow R2C. Could check on the metrics to identify.

pratham76 commented 3 weeks ago

Attaching the screenshots of stage which took longer along with its metrics.

Screenshot 2024-08-21 at 7 04 49 PM Screenshot 2024-08-21 at 7 05 45 PM
zhztheplayer commented 2 weeks ago

Hi @pratham76 thanks.

Although can we view the query on SQL UI? There is a SQL / DataFrame tab on the RHS of the tabs.

image