Any benchmarks? - Githubissues

dolfinus commented 9 months ago

Hi.

Do you have any benchmarks for reading & writing data using Spark Housepower connector vs others, like official JDBC driver?

Spark ClickHouse Connector is a high performance connector but for me it is actually slower than JDBC. For example, writing 32Gb of data (3 columns, 2 billion rows):

Connector	Partitions	batchsize	Time
JDBC	1500	2_000_000	6.7 min
JDBC	40	5_000_000	1.8 min
Housepower	1500	2_000_000	11 min
Housepower	40	5_000_000	6.8 min

Packages I've used:

maven_packages = [
    "com.github.housepower:clickhouse-spark-runtime-3.4_2.12:0.7.3",
    "com.clickhouse:clickhouse-jdbc:0.6.0",
    "com.clickhouse:clickhouse-client:0.6.0",
    "com.clickhouse:clickhouse-http-client:0.6.0",
    "org.apache.httpcomponents.client5:httpclient5:5.3.1",
]

Config:

spark.conf.set("spark.sql.catalog.clickhouse", "xenon.clickhouse.ClickHouseCatalog")
spark.conf.set("spark.sql.catalog.clickhouse.host", "my.clickhouse.domain")
spark.conf.set("spark.sql.catalog.clickhouse.protocol", "http")
spark.conf.set("spark.sql.catalog.clickhouse.http_port", "40101")
spark.conf.set("spark.sql.catalog.clickhouse.user", "default")
spark.conf.set("spark.sql.catalog.clickhouse.password", "")
spark.conf.set("spark.sql.catalog.clickhouse.database", "default")
spark.conf.set("spark.sql.catalog.clickhouse.option.ssl", "false")
spark.conf.set("spark.sql.catalog.clickhouse.option.async", "false")
spark.conf.set("spark.sql.catalog.clickhouse.option.client_name", "onetl")
spark.conf.set("spark.sql.catalog.clickhouse.option.socket_keepalive", "true")
spark.conf.set("spark.clickhouse.ignoreUnsupportedTransform", "false")
spark.conf.set("spark.clickhouse.read.distributed.useClusterNodes", "false")
spark.conf.set("spark.clickhouse.read.distributed.convertLocal", "false")
spark.conf.set("spark.clickhouse.write.batchSize", 5_000_000)
spark.conf.set("spark.clickhouse.write.repartitionStrictly", "false")
spark.conf.set("spark.clickhouse.write.repartitionByPartition", "false")
spark.conf.set("spark.clickhouse.write.localSortByPartition", "false")
spark.conf.set("spark.clickhouse.write.localSortByKey", "false")
spark.conf.set("spark.clickhouse.write.distributed.useClusterNodes", "true")
spark.conf.set("spark.clickhouse.write.distributed.convertLocal", "false")

pan3793 commented 9 months ago

No specific benchmark as Spark and ClickHouse usually run in large clusters.

there are some generic perf tunes guide mentioned in https://github.com/housepower/spark-clickhouse-connector/issues/265#issuecomment-1929474900

dolfinus commented 9 months ago

I've already set repartitionByPartition=false to avoid repartition on the side of connector. In Spark UI all executors (40 in my case) got the same number of rows, so there was no data skew. Both JDBC and Housepower connectors got the same dataframe with the same distribution and number of partitions.

ClickHouse / spark-clickhouse-connector

Any benchmarks? #298