Add support for arrow stream

chdb-io / chdb

chDB is an in-process OLAP SQL Engine 🚀 powered by ClickHouse

https://clickhouse.com/chdb

Apache License 2.0

2.15k stars 75 forks source link

Add support for arrow stream #265

Open djouallah opened 2 months ago

djouallah commented 2 months ago

first congratulation on the progress you made, chDB is substantially better than just 6 months ago, I am trying to read a folder of csv and export it to delta, current I am using df = sess.sql(sql,"ArrowTable") to transfer the data to deltalake Python, the problem is I am getting OOM errors, would be nice if you can add support for arrow recordbatch so the transfer is done in smaller batch

thanks

djouallah commented 2 months ago

@auxten how do you get a schema when using this

df = sess.sql(sql,"ArrowStream")
write_deltalake(f"/lakehouse/default/Tables/T{total_files}/chdb",df, mode="append", partition_by=['year'], storage_options= storage_options)

auxten commented 2 months ago

I understand that what you’re trying to do is retrieve the output schema and then stream the data into Delta Lake.

Regarding the issue of retrieving the schema, I believe it can be obtained by setting the output format to JSON, ArrowTable, DataFrame, etc. However, in cases of large data volumes, a LIMIT should be applied.
Currently, the implementation of chDB requires loading the entire dataset into memory before proceeding with further processing, which can lead to an OOM (out of memory) issue when dealing with large data volumes. This is a point that needs improvement, and I will schedule it for future development.

djouallah commented 2 months ago

I added chdb to my etl benchmarks, feel free to have a look, if i am doing something terribly wrong https://github.com/djouallah/Fabric_Notebooks_Demo/blob/main/ETL/Light_ETL_Python_Notebook.ipynb