ClickHouse / spark-clickhouse-connector

Spark ClickHouse Connector build on DataSourceV2 API
https://clickhouse.com/docs/en/integrations/apache-spark
Apache License 2.0
188 stars 66 forks source link

Fixed ReplacingMergeTree EngineSpec parsing: is_deleted column presence caused error #357

Closed AlexTheKing closed 3 months ago

AlexTheKing commented 3 months ago

Summary

Column for marking rows as deleted (is_deleted column) of ReplacingMergeTree was not anyhow handled during EngineSpec parsing resulting in runtime exception like below:

java.lang.IllegalArgumentException: Expect list size 0 or 1, but got 2
at xenon.clickhouse.parse.ParseUtils$.seqToOption(ParseUtils.scala:32)
at xenon.clickhouse.parse.AstVisitor.visitEngineClause(AstVisitor.scala:78)
at xenon.clickhouse.parse.SQLParser.$anonfun$parseEngineClause$1(SQLParser.scala:30)
at xenon.clickhouse.parse.SQLParser.parse(SQLParser.scala:49)
at xenon.clickhouse.parse.SQLParser.parseEngineClause(SQLParser.scala:30)
at xenon.clickhouse.spec.TableEngineUtils$.resolveTableEngine(TableEngineUtils.scala:24)
at xenon.clickhouse.ClickHouseCatalog.loadTable(ClickHouseCatalog.scala:133)
at xenon.clickhouse.ClickHouseCatalog.loadTable(ClickHouseCatalog.scala:36)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:583)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:566)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:748)

Checklist

Delete items not relevant to your PR:

CLAassistant commented 3 months ago

CLA assistant check
All committers have signed the CLA.

AlexTheKing commented 3 months ago

I'm not sure whether I have something to do with the ClickHouse Cloud tests :)

pan3793 commented 3 months ago

@AlexTheKing the cloud test failure is irrelevant.

instead of dropping the is_deleted column info directly, I'm wondering if we can reserve such info in ReplacingMergeTreeEngineSpec, then we can use such info to filter/process the deleted rows on the spark side later if wanted.

AlexTheKing commented 3 months ago

I can add changes to preserve it in ReplacingMergeTreeEngineSpec if needed

AlexTheKing commented 3 months ago

Added one more commit: now preserving is_deleted column in spec, also added the same for ReplicatedReplacingMergeTree (it had the same issue)

AlexTheKing commented 3 months ago

I fixed test & style, should like OK now

mshustov commented 3 months ago

@mzitnik PTAL

mzitnik commented 3 months ago

Checking Cloud strange behavior, currently will merge PR