Improve bulk operations performance

Altinity / clickhouse-sink-connector

Replicate data from MySQL, Postgres and MongoDB to ClickHouse

https://www.altinity.com

Apache License 2.0

200 stars 48 forks source link

Improve bulk operations performance #574

Open ZlobnyiSerg opened 2 months ago

ZlobnyiSerg commented 2 months ago

I've tested initial load on large datasets and got around 500 rows per second on powerful PC. This is very slow performance compared to built-in ClickHouse Bulk operations with binary transfer support. Can you consider using official ClickHouse connector as part for data transfer? https://github.com/ClickHouse/clickhouse-kafka-connect/blob/main/src/main/java/com/clickhouse/kafka/connect/sink/db/ClickHouseWriter.java It would be good to focus from your side on target database structure and reuse official implementation for data transfer itself.

subkanthi commented 2 months ago

Thanks @ZlobnyiSerg , will take a look, did you find the throughput was better with the official clickhouse connector ( Binary transfer), was the setup still using debezium source connector?

aadant commented 2 months ago

@ZlobnyiSerg if the source is MySQL you can do the initial bulk load as consistent snapshot. See the python directory for utilities. I transferred 28TB in a couple of hours, with a CH database of 3TB. mysql_dumper.py uses mysqlsh under the hood.

aadant commented 2 months ago

@subkanthi there are parameters to tune to insert throughput (https://github.com/Altinity/clickhouse-sink-connector/issues/576)

ZlobnyiSerg commented 2 months ago

@ZlobnyiSerg if the source is MySQL you can do the initial bulk load as consistent snapshot. See the python directory for utilities. I transferred 28TB in a couple of hours, with a CH database of 3TB. mysql_dumper.py uses mysqlsh under the hood.

No, our source is Postgres and solution to manually copy initial snapshots is not suitable for us - we have lot of tables and lot of warehouses "on-premise".

I didn't tested clickhouse connector, but I've tested bulk operations with own C# "data-pumper" - bulk performance was very fast with Clickhouse, unfortunatelly I don't recorded any digits.

I will try to experiment with official connector later.

aadant commented 2 months ago

Insert performance is pretty good, I got several MB/s with

# Max number of records for the flush buffer.
buffer.max.records: "1000000"
# Maximum number of threads in the thread pool for processing CDC records.
thread.pool.size: 10
max.batch.size: 20000
max.queue.size: 100000