chdb-io / chdb

chDB is an in-process OLAP SQL Engine 🚀 powered by ClickHouse
https://clickhouse.com/docs/en/chdb
Apache License 2.0
2.01k stars 72 forks source link

Support complex type in Python Table Engine like Array and Map, Object #251

Open auxten opened 1 month ago

auxten commented 1 month ago

Code snippet to reproduce:

df = pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_community_reports.parquet',
                     columns=["id","community","level","title","summary", "findings","rank","rank_explanation"])

batched_import(statement, df)
df.head(2)

df.dtypes

import chdb

df=df.astype({'id': 'string'})
df2=df[["id","level","rank"]]
chdb.query("FROM Python(df2) SELECT max(id),sum(level),avg(rank)  group by all").show()
df.dtypes

parquet-test-data.zip

auxten commented 1 month ago

Data seems contain dict type, debuging

                                      id community  level  ...                                           findings rank                                   rank_explanation
0   e7822326-4da8-4954-afa9-be7f4f5791a5        42      2  ...  [{'explanation': 'Marley's Ghost plays a cruci...  8.0  The impact severity rating is high due to the ...
1   8a5afac1-99ef-4f01-a1b1-f044ce392ff9        43      2  ...  [{'explanation': 'The Ghost, identified at tim...  8.5  The impact severity rating is high due to the ...
2   83fc3bf3-ce93-4feb-8bb1-b278b062c672        44      2  ...  [{'explanation': 'Mr. Scrooge's character unde...  8.5  The high impact severity rating reflects the p...
3   86d8af0d-738c-45b4-a36c-ec9cd25b56dd        45      2  ...  [{'explanation': 'Bob Cratchit is depicted as ...  7.5  The impact severity rating is relatively high ...
4   77882794-545d-4aba-98c7-2f1294a9e8df        46      2  ...  [{'explanation': 'Martha is a central figure i...  7.5  The impact severity rating is moderately high ...
5   829803bc-0928-4825-9c5a-d07284f8309d        41      2  ...  [{'explanation': 'Ebenezer Scrooge's story beg...  8.5  The high impact severity rating reflects the p...
6   834f9502-3bfd-4500-91c5-0cb14a7ae254        20      1  ...  [{'explanation': 'Jacob Marley's appearance to...  7.5  The impact severity rating is high due to the ...
7   1df0b304-686f-4ac0-92ce-5a9244ad36cf        21      1  ...  [{'explanation': 'The Ghost, often interpreted...  8.5  The impact severity rating is high due to The ...
8   24802e92-18e3-4638-90d0-3872159110dd        22      1  ...  [{'explanation': 'The firm 'Scrooge and Marley...  4.0  The impact severity rating is moderate due to ...
9   bd71ffb7-2324-4987-9947-dc3c49809e18        23      1  ...  [{'explanation': 'Scrooge wields considerable ...  4.0  The impact severity rating is moderate, reflec...
auxten commented 1 month ago

Currently, ClickHouse will read the column "findings" as Array(Tuple(explanation Nullable(String), summary Nullable(String))) from parquet. If the Semistructured Columns is done. Maybe chDB can use JSON type for this kind of column which is more generic and easy to handle in Python.

So, I will wait for https://github.com/ClickHouse/ClickHouse/pull/66444