ClickHouse / spark-clickhouse-connector

Spark ClickHouse Connector build on DataSourceV2 API
https://clickhouse.com/docs/en/integrations/apache-spark
Apache License 2.0
176 stars 63 forks source link

Any benchmarks? #298

Open dolfinus opened 5 months ago

dolfinus commented 5 months ago

Hi.

Do you have any benchmarks for reading & writing data using Spark Housepower connector vs others, like official JDBC driver?

Spark ClickHouse Connector is a high performance connector but for me it is actually slower than JDBC. For example, writing 32Gb of data (3 columns, 2 billion rows):

Connector Partitions batchsize Time
JDBC 1500 2_000_000 6.7 min
JDBC 40 5_000_000 1.8 min
Housepower 1500 2_000_000 11 min
Housepower 40 5_000_000 6.8 min

Packages I've used:

maven_packages = [
    "com.github.housepower:clickhouse-spark-runtime-3.4_2.12:0.7.3",
    "com.clickhouse:clickhouse-jdbc:0.6.0",
    "com.clickhouse:clickhouse-client:0.6.0",
    "com.clickhouse:clickhouse-http-client:0.6.0",
    "org.apache.httpcomponents.client5:httpclient5:5.3.1",
]

Config:

spark.conf.set("spark.sql.catalog.clickhouse", "xenon.clickhouse.ClickHouseCatalog")
spark.conf.set("spark.sql.catalog.clickhouse.host", "my.clickhouse.domain")
spark.conf.set("spark.sql.catalog.clickhouse.protocol", "http")
spark.conf.set("spark.sql.catalog.clickhouse.http_port", "40101")
spark.conf.set("spark.sql.catalog.clickhouse.user", "default")
spark.conf.set("spark.sql.catalog.clickhouse.password", "")
spark.conf.set("spark.sql.catalog.clickhouse.database", "default")
spark.conf.set("spark.sql.catalog.clickhouse.option.ssl", "false")
spark.conf.set("spark.sql.catalog.clickhouse.option.async", "false")
spark.conf.set("spark.sql.catalog.clickhouse.option.client_name", "onetl")
spark.conf.set("spark.sql.catalog.clickhouse.option.socket_keepalive", "true")
spark.conf.set("spark.clickhouse.ignoreUnsupportedTransform", "false")
spark.conf.set("spark.clickhouse.read.distributed.useClusterNodes", "false")
spark.conf.set("spark.clickhouse.read.distributed.convertLocal", "false")
spark.conf.set("spark.clickhouse.write.batchSize", 5_000_000)
spark.conf.set("spark.clickhouse.write.repartitionStrictly", "false")
spark.conf.set("spark.clickhouse.write.repartitionByPartition", "false")
spark.conf.set("spark.clickhouse.write.localSortByPartition", "false")
spark.conf.set("spark.clickhouse.write.localSortByKey", "false")
spark.conf.set("spark.clickhouse.write.distributed.useClusterNodes", "true")
spark.conf.set("spark.clickhouse.write.distributed.convertLocal", "false")
pan3793 commented 5 months ago

No specific benchmark as Spark and ClickHouse usually run in large clusters.

there are some generic perf tunes guide mentioned in https://github.com/housepower/spark-clickhouse-connector/issues/265#issuecomment-1929474900

dolfinus commented 5 months ago

I've already set repartitionByPartition=false to avoid repartition on the side of connector. In Spark UI all executors (40 in my case) got the same number of rows, so there was no data skew. Both JDBC and Housepower connectors got the same dataframe with the same distribution and number of partitions.