CHDB is significantly slower on Arrow tables (in-memory) than with CSV / Parquet

ilyanoskov commented 7 months ago

I have recently had this case, where I had to process a Pandas dataframe with 70M rows that had 5 simple columns and used window functions and GROUP BY operations.

After saving this data to CSV / Parquet and then processing it, CHDB was able to compute the results in 4-5 seconds, and when operating over Arrow, it took close to 30 seconds.

Steps to reproduce this are simple: create a dataframe with random data over 5 columns (id, time, val1, val2, val3) for 70M rows and then perform complex GROUP BY / WINDOW operations. Then save this dataframe to a file and perform the same queries over the file. You will see that the performance is significantly faster.

I would imagine that working with Arrow dataframes would be faster, since accessing memory is faster than accessing disk?

auxten commented 7 months ago

It's discussed on #187. I'm working on it.

auxten commented 2 months ago

Faster path of query on ArrowTable is done on v2.0.0b1 Example: https://github.com/chdb-io/chdb/blob/main/tests/test_query_py.py#L94

chdb-io / chdb

CHDB is significantly slower on Arrow tables (in-memory) than with CSV / Parquet #195