Open sagarlakshmipathy opened 8 months ago
Hi @sagarlakshmipathy Can you please also share the performance number per query? on TPCDS the Q72 is still a trouble for gluten and needs some special config. Here's some discussions: https://github.com/apache/incubator-gluten/issues/1775
Are you testing with HUDI tables by any chance?
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog
For now the HUDI support is not ready in Gluten. It will actually run with vanilla Spark code, and with a RowtoColumn(memcpy) connect to Gluten native operators. So this will actually bring lots of overhead.
thanks, -yuan
Query ID | Gluten Velox Spark Hudi (ms) | OSS Spark Hudi |
---|---|---|
1 | 22040 | 16699 |
2 | 60531 | 33095 |
3 | 61031 | 25965 |
4 | 360561 | 172286 |
5 | 140865 | 72149 |
6 | 48038 | 22890 |
7 | 106637 | 44359 |
8 | 45072 | 19636 |
I didn't bother running the rest of them. I am testing Hudi tables with Gluten. Is there a gh issue/discussion I can +1 to?
It is quite likely due to the fallback of scanning HUDI tables. Here's the issue tracker for unified data lake design, ICEBERG and DELTA LAKE are now both supported(not 100%) now. https://github.com/apache/incubator-gluten/issues/3378
Thanks, -yuan
@sagarlakshmipathy Hey, may I know your setups & configurations for running Gluten on EMR? Thanks!
Backend
VL (Velox)
Bug description
[Expected behavior] Faster query runs compared to OSS Spark [actual behavior] OSS Spark runs in half the time taken by Gluten+Velox Spark.
Spark version
None
Spark configurations
Gluten+Velox+Spark
OSS Spark
System information
Environment: Amazon EMR - 10 workers, 1 driver all
m5.4xlarge
OS: Amazon Linux 2Relevant logs