ClickHouse / clickhouse-java

ClickHouse Java Clients & JDBC Driver
https://clickhouse.com
Apache License 2.0
1.44k stars 529 forks source link

Pass ClickHouse Skip Index parameters while creating Table in ClickHouse from Spark #1093

Open swarupsarangi113 opened 2 years ago

swarupsarangi113 commented 2 years ago

What I am doing?

I am currently writing a dataframe into clickhouse from PySpark using ClickHouse-jdbc driver. The table is not originally present and it is dynamically created while pushing the dataframe to clickhouse.

Issue

While writing dataframe into ClickHouse, I want to pass the below snippet

INDEX modifiedon_9854 modifiedon TYPE minmax GRANULARITY 4

The below code snippet is not working

df.write \
            .format("jdbc") \
            .mode("overwrite") \
            .option("driver", "com.github.housepower.jdbc.ClickHouseDriver") \
            .option("url", self.parameters["connection_properties"]["url"]) \
            .option("createTableOptions", "index modifiedon_9854 modifiedon type minmax granularity 4"
                                          "engine=MergeTree()"
                                          "primary key (prospectid)"
                                          " order by (prospectid,toYYYYMM(createdon), toYYYYMM(modifiedon))"
                                          " partition by (prospectstage)") \
            .option("user", "...") \
            .option("password", "...") \
            .option("dbtable", "...") \
            .option("batchsize", "20000") \
            .option("truncate", "true") \
            .save()

Error Thrown

Caused by: com.github.housepower.exception.ClickHouseSQLException: DB::ExceptionDB::Exception: Syntax error: failed at position 237 ('index'): index modifiedon_9854 modifiedon type minmax granularity 4engine=MergeTree()primary key (prospectid) order by (prospectid,toYYYYMM(createdon), toYYYYMM(modified. Expected one of: storage definition, ENGINE, AS. Stack trace:
...

Working Code Snippet till now

df.write \
            .format("jdbc") \
            .mode("overwrite") \
            .option("driver", "com.github.housepower.jdbc.ClickHouseDriver") \
            .option("url", self.parameters["connection_properties"]["url"]) \
            .option("createTableOptions", "engine=MergeTree()"
                                          "primary key (prospectid)"
                                          " order by (prospectid,toYYYYMM(createdon), toYYYYMM(modifiedon))"
                                          " partition by (prospectstage)") \
            .option("user", "...") \
            .option("password", "...") \
            .option("dbtable", "...") \
            .option("batchsize", "20000") \
            .option("truncate", "true") \
            .save()

Does anyone know how can we achieve this from Spark ?

swarupsarangi113 commented 2 years ago

A workaround I did is using clickhouse-driver Python Library to create table then and there and using that table to load the data into it from Spark