Spark: Support read with settings

ClickHouse / spark-clickhouse-connector

Spark ClickHouse Connector build on DataSourceV2 API

https://clickhouse.com/docs/en/integrations/apache-spark

Apache License 2.0

187 stars 66 forks source link

Spark: Support read with settings #367

Open harryshi10 opened 2 weeks ago

harryshi10 commented 2 weeks ago

Summary

allow read with settings.

close #272

CLAassistant commented 2 weeks ago

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

mzitnik commented 1 week ago

@pan3793, do you think we also need to provide write.settings?

pan3793 commented 1 week ago

code change lgtm, would be great if you could provide a test case

pan3793 commented 1 week ago

@pan3793, do you think we also need to provide write.settings?

yes, it could be implemented in another PR.

Additionally, SPARK-36680 (Spark 4.0) provides a more intuitive SQL syntax for this case

SELECT * FROM $t1 WITH (split-size = 5)

harryshi10 commented 1 week ago

sorry I'm still a rookie at Scala. but I will try to write a UT for this new feature

mzitnik commented 1 week ago

@harryshi10 could please sign CLA

harryshi10 commented 6 days ago

@harryshi10 could please sign CLA

done

harryshi10 commented 4 hours ago

code change lgtm, would be great if you could provide a test case

Sorry, I can’t provide a unit test, but here’s a test case I ran locally with PySpark.

env - ClickHouse = 24.10.2.80, Spark = 3.5.0

A SummingMergeTree with two records sharing the same key shows duplicates when queried without FINAL, but returns aggregated results when queried with FINAL.

In Spark, setting final=0 or final=1 in spark.clickhouse.read.settings controls whether the results are aggregated or not, with final=0 showing non-aggregated results and final=1 providing aggregated results.

I also tested that adding final=0 or 1 in spark.clickhouse.read.settings has no side effect on other engines, such as MergeTree.