Closed taiyang-li closed 2 weeks ago
Run Gluten Clickhouse CI on x86
Run Gluten Clickhouse CI on x86
Run Gluten Clickhouse CI on x86
Run Gluten Clickhouse CI on x86
0: jdbc:hive2://localhost:10000/> set spark.sql.planChangeLog.level = error;
+--------------------------------+--------+
| key | value |
+--------------------------------+--------+
| spark.sql.planChangeLog.level | error |
+--------------------------------+--------+
1 row selected (0.884 seconds)
0: jdbc:hive2://localhost:10000/> set spark.gluten.enabled = true;
+-----------------------+--------+
| key | value |
+-----------------------+--------+
| spark.gluten.enabled | true |
+-----------------------+--------+
1 row selected (0.091 seconds)
0: jdbc:hive2://localhost:10000/>
0: jdbc:hive2://localhost:10000/> CREATE TABLE aj (
. . . . . . . . . . . . . . . . > country STRING,
. . . . . . . . . . . . . . . . > event STRUCT<time:BIGINT, lng:BIGINT, lat:BIGINT, net:STRING,
. . . . . . . . . . . . . . . . > log_extra:MAP<STRING, STRING>, event_id:STRING, event_info:MAP<STRING, STRING>>
. . . . . . . . . . . . . . . . > )
. . . . . . . . . . . . . . . . > USING orc;
+---------+
| Result |
+---------+
+---------+
No rows selected (0.981 seconds)
0: jdbc:hive2://localhost:10000/>
0: jdbc:hive2://localhost:10000/>
0: jdbc:hive2://localhost:10000/> INSERT INTO aj VALUES
. . . . . . . . . . . . . . . . > ('USA', named_struct('time', 1622547800, 'lng', -122, 'lat', 37, 'net',
. . . . . . . . . . . . . . . . > 'wifi', 'log_extra', map('key1', 'value1'), 'event_id', 'event1',
. . . . . . . . . . . . . . . . > 'event_info', map('tab_type', '5', 'action', '13'))),
. . . . . . . . . . . . . . . . > ('Canada', named_struct('time', 1622547801, 'lng', -79, 'lat', 43, 'net',
. . . . . . . . . . . . . . . . > '4g', 'log_extra', map('key2', 'value2'), 'event_id', 'event2',
. . . . . . . . . . . . . . . . > 'event_info', map('tab_type', '4', 'action', '12')));
+---------+
| Result |
+---------+
+---------+
No rows selected (2.959 seconds)
0: jdbc:hive2://localhost:10000/> ;
0: jdbc:hive2://localhost:10000/>
0: jdbc:hive2://localhost:10000/>
0: jdbc:hive2://localhost:10000/>
0: jdbc:hive2://localhost:10000/> explain extended
. . . . . . . . . . . . . . . . > SELECT * FROM (
. . . . . . . . . . . . . . . . > SELECT
. . . . . . . . . . . . . . . . > game_name,
. . . . . . . . . . . . . . . . > CASE WHEN
. . . . . . . . . . . . . . . . > event.event_info['tab_type'] IN (5) THEN '1' ELSE '0' END AS entrance
. . . . . . . . . . . . . . . . > FROM aj
. . . . . . . . . . . . . . . . > LATERAL VIEW explode(split(country, ', ')) game_name AS game_name
. . . . . . . . . . . . . . . . > WHERE event.event_info['action'] IN (13)
. . . . . . . . . . . . . . . . > ) WHERE game_name = 'xxx';
+----------------------------------------------------+
| plan |
+----------------------------------------------------+
| == Parsed Logical Plan ==
'Project [*]
+- 'Filter ('game_name = xxx)
+- 'SubqueryAlias __auto_generated_subquery_name
+- 'Project ['game_name, CASE WHEN 'event.event_info[tab_type] IN (5) THEN 1 ELSE 0 END AS entrance#34]
+- 'Filter 'event.event_info[action] IN (13)
+- 'Generate 'explode('split('country, , )), false, game_name, ['game_name]
+- 'UnresolvedRelation [aj], [], false
== Analyzed Logical Plan ==
game_name: string, entrance: string
Project [game_name#42, entrance#34]
+- Filter (game_name#42 = xxx)
+- SubqueryAlias __auto_generated_subquery_name
+- Project [game_name#42, CASE WHEN cast(event#41.event_info[tab_type] as string) IN (cast(5 as string)) THEN 1 ELSE 0 END AS entrance#34]
+- Filter cast(event#41.event_info[action] as string) IN (cast(13 as string))
+- Generate explode(split(country#40, , , -1)), false, game_name, [game_name#42]
+- SubqueryAlias spark_catalog.default.aj
+- Relation default.aj[country#40,event#41] orc
== Optimized Logical Plan ==
Project [game_name#42, CASE WHEN (_extract_event_info#46[tab_type] = 5) THEN 1 ELSE 0 END AS entrance#34]
+- Filter (game_name#42 = xxx)
+- Generate explode(split(country#40, , , -1)), [0], false, game_name, [game_name#42]
+- Project [country#40, event#41.event_info AS _extract_event_info#46]
+- Filter (isnotnull(event#41.event_info) AND (event#41.event_info[action] = 13))
+- Relation default.aj[country#40,event#41] orc
== Physical Plan ==
CHNativeColumnarToRow
+- ^(1) ProjectExecTransformer [game_name#42, CASE WHEN (_extract_event_info#46[tab_type] = 5) THEN 1 ELSE 0 END AS entrance#34]
+- ^(1) FilterExecTransformer (game_name#42 = xxx)
+- ^(1) CHGenerateExecTransformer explode(split(country#40, , , -1)), [_extract_event_info#46], false, [game_name#42]
+- ^(1) ProjectExecTransformer [country#40, event#41.event_info AS _extract_event_info#46]
+- ^(1) FilterExecTransformer (isnotnull(event#41.event_info) AND (event#41.event_info[action] = 13))
+- ^(1) NativeFileScan orc default.aj[country#40,event#41] Batched: true, DataFilters: [isnotnull(event#41.event_info), (event#41.event_info[action] = 13)], Format: ORC, Location: InMemoryFileIndex(1 paths)[file:/data1/liyang/cppproject/spark/spark-3.3.2-bin-hadoop3/spark-ware..., PartitionFilters: [], PushedFilters: [IsNotNull(event.event_info)], ReadSchema: struct<country:string,event:struct<event_info:map<string,string>>>
|
+----------------------------------------------------+
1 row selected (1.059 seconds)
0: jdbc:hive2://localhost:10000/> ;
0: jdbc:hive2://localhost:10000/> desc aj;
+-----------+----------------------------------------------------+----------+
| col_name | data_type | comment |
+-----------+----------------------------------------------------+----------+
| country | string | NULL |
| event | struct<time:bigint,lng:bigint,lat:bigint,net:string,log_extra:map<string,string>,event_id:string,event_info:map<string,string>> | NULL |
+-----------+----------------------------------------------------+----------+
2 rows selected (0.253 seconds)
0: jdbc:hive2://localhost:10000/>
0: jdbc:hive2://localhost:10000/>
0: jdbc:hive2://localhost:10000/>
0: jdbc:hive2://localhost:10000/> explain formatted
. . . . . . . . . . . . . . . . > SELECT * FROM (
. . . . . . . . . . . . . . . . > SELECT
. . . . . . . . . . . . . . . . > game_name,
. . . . . . . . . . . . . . . . > CASE WHEN
. . . . . . . . . . . . . . . . > event.event_info['tab_type'] IN (5) THEN '1' ELSE '0' END AS entrance
. . . . . . . . . . . . . . . . > FROM aj
. . . . . . . . . . . . . . . . > LATERAL VIEW explode(split(nvl(event.event_info['game_name'],'0'),',')) game_name as game_name
. . . . . . . . . . . . . . . . > WHERE event.event_info['action'] IN (13)
. . . . . . . . . . . . . . . . > ) WHERE game_name = 'xxx';
+----------------------------------------------------+
| plan |
+----------------------------------------------------+
| == Physical Plan ==
CHNativeColumnarToRow (8)
+- ^ ProjectExecTransformer (6)
+- ^ FilterExecTransformer (5)
+- ^ CHGenerateExecTransformer (4)
+- ^ ProjectExecTransformer (3)
+- ^ FilterExecTransformer (2)
+- ^ Scan orc default.aj (1)
(1) Scan orc default.aj
Output [1]: [event#41]
Batched: true
Location: InMemoryFileIndex [file:/data1/liyang/cppproject/spark/spark-3.3.2-bin-hadoop3/spark-warehouse/aj]
PushedFilters: [IsNotNull(event.event_info)]
ReadSchema: struct<event:struct<event_info:map<string,string>>>
(2) FilterExecTransformer
Input [1]: [event#41]
Arguments: (isnotnull(event#41.event_info) AND (event#41.event_info[action] = 13))
(3) ProjectExecTransformer
Output [1]: [event#41.event_info AS _extract_event_info#73]
Input [1]: [event#41]
(4) CHGenerateExecTransformer
Input [1]: [_extract_event_info#73]
Arguments: explode(split(coalesce(_extract_event_info#73[game_name], 0), ,, -1)), [_extract_event_info#73], false, [game_name#70]
(5) FilterExecTransformer
Input [2]: [_extract_event_info#73, game_name#70]
Arguments: (game_name#70 = xxx)
(6) ProjectExecTransformer
Output [2]: [game_name#70, CASE WHEN (_extract_event_info#73[tab_type] = 5) THEN 1 ELSE 0 END AS entrance#63]
Input [2]: [_extract_event_info#73, game_name#70]
(7) WholeStageCodegenTransformer (2)
Input [2]: [game_name#70, entrance#63]
Arguments: false
(8) CHNativeColumnarToRow
Input [2]: [game_name#70, entrance#63]
|
+----------------------------------------------------+
1 row selected (0.295 seconds)
Run Gluten Clickhouse CI on x86
Run Gluten Clickhouse CI on x86
Run Gluten Clickhouse CI on x86
@PHILO-HE I found vanilla spark nested column pruning doesn't work for Project(Filter(Generate))
(refer to https://github.com/apache/incubator-gluten/pull/7869#issuecomment-2464408071). So I added a rule in CH to improve it. I'm curious if Velox needs it?
Run Gluten Clickhouse CI on x86
Run Gluten Clickhouse CI on x86
Run Gluten Clickhouse CI on x86
@taiyang-li, thanks so much for letting me know your this work! I think it should be applicable to Velox backend. Maybe, you can firstly get this pr merged, then try to move the proposed code to common module in another pr to see whether there is any issue reported for velox backend.
BTW, I note you introduced one dedicated config for the proposed optimization rule. Maybe, it's better to have a generic config in Gluten to allow excluding any optimization rule, like Spark's spark.sql.optimizer.excludedRules
. If it makes sense to you, it's ok to do this small improvement in a separate pr. Thanks!
@PHILO-HE I'm really glad that Velox could also use it! I'll open another pr after this one is merged.
What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
(Fixes: #7868)
How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)