chdb-io / chdb

chDB is an in-process OLAP SQL Engine 🚀 powered by ClickHouse
https://clickhouse.com/chdb
Apache License 2.0
2.15k stars 75 forks source link

Questions regarding zero-copy #187

Open rschu1ze opened 10 months ago

rschu1ze commented 10 months ago

Hello,

I browsed the source code of chDB today (having read DuckDBs papers first) and had two questions:

As of today, chDB takes over the final query result (which is presumably refcounted) for further processing (see LocalServer.cpp).

  1. Is it planned (or even possible) to have a more fine-granular mechanism for handing over results based on the data chunks used internally for query processing? In DuckDB, the application can fetch individual result chunks by triggering pull() on the execution plan.

  2. Likewise, I did not find a zero-copy mechanism for source data, meaning that right now the host process must first write the data to process to a local file and then let the embedded ClickHouse load it via the file engine. Did I miss something?

Thanks!

auxten commented 10 months ago

Thanks for your great questions!

  1. The current implementation can only read all the returned data once. However, it does not yet support a smooth scanning of results using a cursor. I will attempt to address this in the next release.
  2. Another insightful point is that zero-copy only works in Python when handling returned data with memoryview.
    • When querying Parquet, CSV, or any other files using something like SELECT * FROM file('path', Parquet), there is no unnecessary data copy.
    • However, when querying "memory" objects such as DataFrame, ArrowTable, or data returned by chDB, the current implementation is just "okay to work." It writes the data into a temporary file or memfd and then queries it as a file. This approach is also unsightly and I will attempt to improve it.

Regarding the last question, I would like to share some thoughts as well. The main idea is to use Arrow as the in-memory data type. Currently, I am researching two ways to achieve this:

  1. Develop new file-related APIs that enable ClickHouse engine to read memory like a file.
  2. Create a new storage type that allows ClickHouse to read Arrow buffer from memory.

I haven't decide which way is better. Welcome to discuss with us.

rschu1ze commented 10 months ago

Thanks, that is very helpful. Great to hear that the points are known and being improved.

I don't have much insights in the internals of chDB. About importing data efficiently: ClickHouse supports lots of input/output format, e.g. cat filename.orc | clickhouse-client --query="INSERT INTO some_table FORMAT ORC". Perhaps it is an option to add another special memory-view-like I/O format which is passed a pointer + size and which internally assumes the data is encoded as Arrow format.

SChakravorti21 commented 8 months ago

Just throwing in my $0.02 in case it's of any use. Regarding this part of the discussion:

Currently, I am researching two ways to achieve this:

  • Develop new file-related APIs that enable ClickHouse engine to read memory like a file.
  • Create a new storage type that allows ClickHouse to read Arrow buffer from memory.

Perhaps one option is to use Arrow's C data interface or C stream interface. These allow Arrow buffers to be shared across language boundaries in a zero-copy manner within a single process. If I understand correctly, this is how some engines like Polars and DuckDB already handle querying in-memory Arrow tables today.

I don't know anything about the internals of ClickHouse, but maybe this approach could make it easier/cleaner to implement a custom "storage type" as you mention. And vice versa, the C stream interface might help with exposing results as a stream of Arrow record batches.

auxten commented 8 months ago

Just throwing in my $0.02 in case it's of any use. Regarding this part of the discussion:

Currently, I am researching two ways to achieve this:

  • Develop new file-related APIs that enable ClickHouse engine to read memory like a file.

  • Create a new storage type that allows ClickHouse to read Arrow buffer from memory.

Perhaps one option is to use Arrow's C data interface or C stream interface. These allow Arrow buffers to be shared across language boundaries in a zero-copy manner within a single process. If I understand correctly, this is how some engines like Polars and DuckDB already handle querying in-memory Arrow tables today.

I don't know anything about the internals of ClickHouse, but maybe this approach could make it easier/cleaner to implement a custom "storage type" as you mention. And vice versa, the C stream interface might help with exposing results as a stream of Arrow record batches.

Thanks for your great advice. I'm researching on it.