ClickHouse / clickhouse-java

Java client and JDBC driver for ClickHouse
https://clickhouse.com
Apache License 2.0
1.4k stars 515 forks source link

Error writing request body to server, server ClickHouseNode #1410

Open ekatabavkar opened 11 months ago

ekatabavkar commented 11 months ago

We are trying to load data into ReplicatedMergeTree table in clickhouse via spark from Google Dataproc Cluster. However, we are getting intermittent Error while inserting data using jdbc driver: 0.46 shaded

Clickhouse version: 23.4.6.25 spark version: 2.4.8

image

Error shown below(sensitive data is hidden) :

[Stage 4:=========>                                             (24 + 15) / 136]23/08/02 13:08:51 WARN 
org.apache.spark.scheduler.TaskSetManager: Lost task 24.0 in stage 4.0 (TID 5320, dataproc_cluster, executor 15): java.sql.BatchUpdateException: Error writing request body to server, server ClickHouseNode [uri=http://dummyclickhouseserver:8123/default, options={async_insert=1,wait_for_async_insert=1}]@668945829
        at com.clickhouse.jdbc.SqlExceptionUtils.batchUpdateError(SqlExceptionUtils.java:107)
        at com.clickhouse.jdbc.internal.InputBasedPreparedStatement.executeAny(InputBasedPreparedStatement.java:154)
        at com.clickhouse.jdbc.internal.AbstractPreparedStatement.executeLargeBatch(AbstractPreparedStatement.java:85)
        at com.clickhouse.jdbc.internal.ClickHouseStatementImpl.executeBatch(ClickHouseStatementImpl.java:754)
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:671)
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1(JdbcUtils.scala:840)
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1$adapted(JdbcUtils.scala:838)
        at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:980)
        at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:980)
        at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2116)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:414)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:417)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

There are no errors in the clickhouse server logs. When this error occurs, spark tries to retry the failed task which eventually succeeds, thereby resulting in duplicate data. When does this error occur and how can we prevent to avoid duplicate data ?

zhicwu commented 11 months ago

Hi @ekatabavkar, sorry for the late reply.

There are no errors in the clickhouse server logs.

If we can confirm that the query has no issue, then it's probably just related to the network. You may want check ClickHouse/clickhouse-docs#1178 and see if it will help.

When does this error occur and how can we prevent to avoid duplicate data ?

See https://kb.altinity.com/altinity-kb-schema-design/insert_deduplication/.

mshustov commented 11 months ago

thereby resulting in duplicate data. When does this error occur and how can we prevent to avoid duplicate data ?

@ekatabavkar you should disable async_insert to avoid problems with deduplication. see https://clickhouse.com/docs/en/optimize/asynchronous-inserts

AUTOMATIC DEDUPLICATION IS DISABLED BY DEFAULT WHEN USING ASYNCHRONOUS INSERTS Manual batching (see section above)) has the advantage that it supports the built-in automatic deduplication of table data if (exactly) the same insert statement is sent multiple times to ClickHouse Cloud, for example, because of an automatic retry in client software because of some temporary network connection issues.