chdb-io / chdb

chDB is an in-process OLAP SQL Engine 🚀 powered by ClickHouse
https://doc.chdb.io
Apache License 2.0
1.85k stars 71 forks source link

Distributed query processing #86

Open danthegoodman1 opened 11 months ago

danthegoodman1 commented 11 months ago

Like distributed queries in clickhouse, it would be great to have a (semi-)native way to process distributed queries.

For example, being able to divide the list of parquet files up on to multiple hosts running chdb in-process, and then having them reduce down to the initial node that ran the query. This is somewhat possible manually (in theory).

Processes like choosing what files go to what hosts could be left for the developer, but the map-reduce across hosts is the functionlaity that would be ideal to have natively. Even if the data passing was through some binary RPC calls that the developer has to implement, telling the CHDB that it's a partial query and it has to deliver results that can be aggregated down on a single final worker is something that would need to be in CHDB itself.

blackrez commented 9 months ago

Hello, I'm experimenting chdb and celery for distribute and concurrent queries on a web application and it works great. I thing like dremel maybe you should add a query planner for distribute data across diferents nodes.