[GLUTEN-3839][CH] Extend nested column pruning in vanilla spark

taiyang-li commented 2 days ago

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

(Fixes: #3839)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

github-actions[bot] commented 2 days ago

https://github.com/apache/incubator-gluten/issues/3839

github-actions[bot] commented 2 days ago

Run Gluten Clickhouse CI on x86

taiyang-li commented 2 days ago

Performance comparison:

CREATE TEMPORARY VIEW test_table
USING org.apache.spark.sql.parquet
OPTIONS (
  path  "/data1/liyang/cppproject/spark/spark-3.3.2-bin-hadoop3/bigo_live_user_event"
) ;

select 
case when event.log_extra['tab_type'] in (5) then '1' else '0' end as entrance
from test_table
lateral view explode(events)  as event
where  event.log_extra['action'] in (13)  

set spark.gluten.sql.extendedGeneratorNestedColumnAliasing = true;
No rows selected (0.546 seconds)

set spark.gluten.sql.extendedGeneratorNestedColumnAliasing = false;
No rows selected (9.326 seconds)

github-actions[bot] commented 2 days ago

Run Gluten Clickhouse CI on x86

github-actions[bot] commented 2 days ago

Run Gluten Clickhouse CI on x86

github-actions[bot] commented 1 day ago

Run Gluten Clickhouse CI on x86

taiyang-li commented 1 day ago

Another performance comparison on production. The change is not obvious because the pruned columns are so small. Notice the output bytes of scan operator(8.6 TB vs 8.7 TB)

Query: d_12768_1.sql

Run query with set spark.gluten.sql.extendedGeneratorNestedColumnAliasing = true;

SubstraitFileSourceStep (read local files)
                                          Header: uid Nullable(Int64)
                                                  country Nullable(String)
                                                  events Nullable(Array(Nullable(Tuple(event_id Nullable(String), log_extra Nullable(Map(String, Nullable(String))), event_info Nullable(Map(String, Nullable(String)))))))
                                                  day Nullable(String)

Run query with set spark.gluten.sql.extendedGeneratorNestedColumnAliasing = false

SubstraitFileSourceStep (read local files)
                                              Header: uid Nullable(Int64)
                                                      country Nullable(String)
                                                      events Nullable(Array(Nullable(Tuple(time Nullable(Int64), lng Nullable(Int64), lat Nullable(Int64), net Nullable(String), event_id Nullable(String), log_extra Nullable(Map(String, Nullable(String))), event_info Nullable(Map(String, Nullable(String)))))))
                                                      day Nullable(String)

github-actions[bot] commented 1 day ago

Run Gluten Clickhouse CI on x86

github-actions[bot] commented 1 day ago

Run Gluten Clickhouse CI on x86

github-actions[bot] commented 22 hours ago

Run Gluten Clickhouse CI on x86

github-actions[bot] commented 22 hours ago

Run Gluten Clickhouse CI on x86

github-actions[bot] commented 21 hours ago

Run Gluten Clickhouse CI on x86

github-actions[bot] commented 17 hours ago

Run Gluten Clickhouse CI on x86

apache / incubator-gluten

[GLUTEN-3839][CH] Extend nested column pruning in vanilla spark #7992

What changes were proposed in this pull request?

How was this patch tested?