ClickHouse / clickhouse-connect

Python driver/sqlalchemy/superset connectors
Apache License 2.0
308 stars 63 forks source link

Improve Insert Performance #106

Open genzgd opened 1 year ago

genzgd commented 1 year ago

Insert data transformation is currently all in Python. This should be moved to C/Cython where expensive. It should also be possible to optimize Numpy/Pandas inserts by reading the Numpy buffer directly for int/float datatypes.

### Tasks
- [x] https://github.com/ClickHouse/clickhouse-connect/issues/201
- [ ] Direct copy of numpy arrays to output buffer
- [ ] Buffered bytearray destination for all writes to reduce allocations
- [ ] Datetime optimizations in C
- [ ] Decimal optimizations in C
- [x] Dynamically calculate block size
carterjfulcher commented 1 year ago

I second this. When inserting large data (1gb+), it takes way longer than it should.

genzgd commented 1 year ago

@carterjfulcher Just curious, how are you inserting the data, and how long does it take? Are you using Pandas or Numpy? Do you have a lot of date objects? (Native Python dates are quite inefficient imo).

Python is, you know, a slow interpreted language (and the bulk of the clickhouse-connect insert code is still pure Python), so as a company, ClickHouse in general doesn't recommend Python clients for high volume inserts. From my experience high volume workloads tend to be done in custom applications using Java, Scala, Rust or Go.

That's a roundabout way of saying that while I am always interested in improving performance, if that's your primary goal I would look at your tech stack first, and not necessarily wait for additional Python optimizations.

chriscomeau79 commented 10 months ago

I've generally been doing all my high-throughput data engineering work in python with Arrow, using libraries like duckdb and arrow-odbc which use C and Rust under the hood.

The example here gets about 1M rows/sec. (Arrow vs ArrowStream comes down to convenience and available memory) https://github.com/ClickHouse/clickhouse-connect/issues/267

0liu commented 3 months ago

Just did a Pandas dataframe insertion performance comparison between clickhouse-connect and clickhouse-driver. The table or dataframe has 71.9M rows and 17 columns (3 strings, 1 Date, 1 Datetime64, UInt, Int and Float64). For query the whole table, clickhouse-connect is about 14x faster, using 31 secs vs clickhouse-driver 476 secs. For inserting the whole dataframe, clickhouse-driver is 3.2x faster, using 65 secs vs clickhouse-connect 210 secs. Clickhouse-driver has some limits, such as not supporting Array in numpy mode, not supporting nanoseconds outside numpy mode, etc. Really hope clickhouse-connect can improve insertion performance.