Open rschu1ze opened 10 months ago
Thanks for your great questions!
SELECT * FROM file('path', Parquet)
, there is no unnecessary data copy.Regarding the last question, I would like to share some thoughts as well. The main idea is to use Arrow as the in-memory data type. Currently, I am researching two ways to achieve this:
I haven't decide which way is better. Welcome to discuss with us.
Thanks, that is very helpful. Great to hear that the points are known and being improved.
I don't have much insights in the internals of chDB. About importing data efficiently: ClickHouse supports lots of input/output format, e.g. cat filename.orc | clickhouse-client --query="INSERT INTO some_table FORMAT ORC"
. Perhaps it is an option to add another special memory-view-like I/O format which is passed a pointer + size and which internally assumes the data is encoded as Arrow format.
Just throwing in my $0.02 in case it's of any use. Regarding this part of the discussion:
Currently, I am researching two ways to achieve this:
- Develop new file-related APIs that enable ClickHouse engine to read memory like a file.
- Create a new storage type that allows ClickHouse to read Arrow buffer from memory.
Perhaps one option is to use Arrow's C data interface or C stream interface. These allow Arrow buffers to be shared across language boundaries in a zero-copy manner within a single process. If I understand correctly, this is how some engines like Polars and DuckDB already handle querying in-memory Arrow tables today.
I don't know anything about the internals of ClickHouse, but maybe this approach could make it easier/cleaner to implement a custom "storage type" as you mention. And vice versa, the C stream interface might help with exposing results as a stream of Arrow record batches.
Just throwing in my $0.02 in case it's of any use. Regarding this part of the discussion:
Currently, I am researching two ways to achieve this:
Develop new file-related APIs that enable ClickHouse engine to read memory like a file.
Create a new storage type that allows ClickHouse to read Arrow buffer from memory.
Perhaps one option is to use Arrow's C data interface or C stream interface. These allow Arrow buffers to be shared across language boundaries in a zero-copy manner within a single process. If I understand correctly, this is how some engines like Polars and DuckDB already handle querying in-memory Arrow tables today.
I don't know anything about the internals of ClickHouse, but maybe this approach could make it easier/cleaner to implement a custom "storage type" as you mention. And vice versa, the C stream interface might help with exposing results as a stream of Arrow record batches.
Thanks for your great advice. I'm researching on it.
Hello,
I browsed the source code of chDB today (having read DuckDBs papers first) and had two questions:
As of today, chDB takes over the final query result (which is presumably refcounted) for further processing (see LocalServer.cpp).
Is it planned (or even possible) to have a more fine-granular mechanism for handing over results based on the data chunks used internally for query processing? In DuckDB, the application can fetch individual result chunks by triggering
pull()
on the execution plan.Likewise, I did not find a zero-copy mechanism for source data, meaning that right now the host process must first write the data to process to a local file and then let the embedded ClickHouse load it via the file engine. Did I miss something?
Thanks!