Kyligence / ClickHouse

ClickHouse® is a free analytics DBMS for big data
https://clickhouse.com
Apache License 2.0
14 stars 17 forks source link

Allow nullable complex types when writing into parquet/orc #452

Closed taiyang-li closed 1 year ago

taiyang-li commented 1 year ago

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Allow nullable complex types when writing into parquet/orc. close https://github.com/oap-project/gluten/issues/2466

kyligence-git commented 1 year ago

Can one of the admins verify this patch?

taiyang-li commented 1 year ago
0: jdbc:hive2://localhost:10000/> set spark.gluten.sql.native.parquet.writer.enabled = true; 
+-------------------------------------------------+--------+
|                       key                       | value  |
+-------------------------------------------------+--------+
| spark.gluten.sql.native.parquet.writer.enabled  | true   |
+-------------------------------------------------+--------+
1 row selected (1.22 seconds)
0: jdbc:hive2://localhost:10000/> CREATE TABLE t(
. . . . . . . . . . . . . . . . >   id INT,
. . . . . . . . . . . . . . . . >   info STRUCT<name:STRING, age:INT>,
. . . . . . . . . . . . . . . . >   data MAP<STRING, INT>,
. . . . . . . . . . . . . . . . >   values ARRAY<INT>
. . . . . . . . . . . . . . . . > ) stored as parquet;
+---------+
| Result  |
+---------+
+---------+
No rows selected (3.062 seconds)
0: jdbc:hive2://localhost:10000/> INSERT overwrite t VALUES
. . . . . . . . . . . . . . . . >   (1, struct('John', 25), map('A', 10, 'B', 20), array(1.0, 2.0, 3.0)),
. . . . . . . . . . . . . . . . >   (2, struct('Alice', 30), map('C', 15, 'D', 25), array(4.0, 5.0, 6.0)),
. . . . . . . . . . . . . . . . >   (3, struct('Bob', 35), map('E', 12, 'F', 18), array(7.0, 8.0, 9.0)),
. . . . . . . . . . . . . . . . >   (4, struct('Jane', 40), map('G', 22, 'H', 30), array(10.0, 11.0, 12.0)),
. . . . . . . . . . . . . . . . >   (5, struct('Kate', 45), map('I', 17, 'J', 28), array(13.0, 14.0, 15.0)),
. . . . . . . . . . . . . . . . >   (6, null, null, null),
. . . . . . . . . . . . . . . . >   (7, struct('Tank', 20), map('X', null, 'Y', null), array(1.0, 2.0, 3.0));
+---------+
| Result  |
+---------+
+---------+
No rows selected (3.917 seconds)
0: jdbc:hive2://localhost:10000/> select * from t; 
+-----+---------------------------+----------------------+-------------+
| id  |           info            |         data         |   values    |
+-----+---------------------------+----------------------+-------------+
| 1   | {"name":null,"age":null}  | {"A":10,"B":20}      | [1,2,3]     |
| 2   | {"name":null,"age":null}  | {"C":15,"D":25}      | [4,5,6]     |
| 5   | {"name":null,"age":null}  | {"I":17,"J":28}      | [13,14,15]  |
| 6   | NULL                      | NULL                 | NULL        |
| 7   | {"name":null,"age":null}  | {"X":null,"Y":null}  | [1,2,3]     |
| 3   | {"name":null,"age":null}  | {"E":12,"F":18}      | [7,8,9]     |
| 4   | {"name":null,"age":null}  | {"G":22,"H":30}      | [10,11,12]  |
+-----+---------------------------+----------------------+-------------+
7 rows selected (1.32 seconds)
$ cd spark-warehouse/t 
$ parquet-tools csv ./part-000*                                                           
id,info,data,values
1,"{'1': 'John', '2': 25}","[('A', 10), ('B', 20)]",[1 2 3]
2,"{'1': 'Alice', '2': 30}","[('C', 15), ('D', 25)]",[4 5 6]
3,"{'1': 'Bob', '2': 35}","[('E', 12), ('F', 18)]",[7 8 9]
4,"{'1': 'Jane', '2': 40}","[('G', 22), ('H', 30)]",[10 11 12]
5,"{'1': 'Kate', '2': 45.0}","[('I', 17.0), ('J', 28.0)]",[13 14 15]
6,,,
7,"{'1': 'Tank', '2': 20.0}","[('X', None), ('Y', None)]",[1 2 3]
taiyang-li commented 1 year ago

@binmahone @liuneng1994 @zzcclp Can you review this pr

lgbo-ustc commented 1 year ago

LGTM