apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.22k stars 437 forks source link

[GLUTEN-3839][CH] Extend nested column pruning in vanilla spark #7992

Open taiyang-li opened 2 days ago

taiyang-li commented 2 days ago

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

(Fixes: #3839)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

github-actions[bot] commented 2 days ago

https://github.com/apache/incubator-gluten/issues/3839

github-actions[bot] commented 2 days ago

Run Gluten Clickhouse CI on x86

taiyang-li commented 2 days ago

Performance comparison:

CREATE TEMPORARY VIEW test_table
USING org.apache.spark.sql.parquet
OPTIONS (
  path  "/data1/liyang/cppproject/spark/spark-3.3.2-bin-hadoop3/bigo_live_user_event"
) ;

select 
case when event.log_extra['tab_type'] in (5) then '1' else '0' end as entrance
from test_table
lateral view explode(events)  as event
where  event.log_extra['action'] in (13)  

set spark.gluten.sql.extendedGeneratorNestedColumnAliasing = true;
No rows selected (0.546 seconds)

set spark.gluten.sql.extendedGeneratorNestedColumnAliasing = false;
No rows selected (9.326 seconds)
github-actions[bot] commented 2 days ago

Run Gluten Clickhouse CI on x86

github-actions[bot] commented 2 days ago

Run Gluten Clickhouse CI on x86

github-actions[bot] commented 1 day ago

Run Gluten Clickhouse CI on x86

taiyang-li commented 1 day ago

Another performance comparison on production. The change is not obvious because the pruned columns are so small. Notice the output bytes of scan operator(8.6 TB vs 8.7 TB)

Query: d_12768_1.sql

Run query with set spark.gluten.sql.extendedGeneratorNestedColumnAliasing = true; image

SubstraitFileSourceStep (read local files)
                                          Header: uid Nullable(Int64)
                                                  country Nullable(String)
                                                  events Nullable(Array(Nullable(Tuple(event_id Nullable(String), log_extra Nullable(Map(String, Nullable(String))), event_info Nullable(Map(String, Nullable(String)))))))
                                                  day Nullable(String)

Run query with set spark.gluten.sql.extendedGeneratorNestedColumnAliasing = false image

SubstraitFileSourceStep (read local files)
                                              Header: uid Nullable(Int64)
                                                      country Nullable(String)
                                                      events Nullable(Array(Nullable(Tuple(time Nullable(Int64), lng Nullable(Int64), lat Nullable(Int64), net Nullable(String), event_id Nullable(String), log_extra Nullable(Map(String, Nullable(String))), event_info Nullable(Map(String, Nullable(String)))))))
                                                      day Nullable(String)
github-actions[bot] commented 1 day ago

Run Gluten Clickhouse CI on x86

github-actions[bot] commented 1 day ago

Run Gluten Clickhouse CI on x86

github-actions[bot] commented 22 hours ago

Run Gluten Clickhouse CI on x86

github-actions[bot] commented 22 hours ago

Run Gluten Clickhouse CI on x86

github-actions[bot] commented 21 hours ago

Run Gluten Clickhouse CI on x86

github-actions[bot] commented 17 hours ago

Run Gluten Clickhouse CI on x86