Export hive data to clickhouse cluster by seatunnel, and the data is always imported to only one clickhouse node.

Toroidals commented 1 year ago

Search before asking

[X] I had searched in the issues and found no similar issues.

What happened

Export hive data to clickhouse cluster by seatunnel, and the data is always imported to only one clickhouse node.

SeaTunnel Version

seatunnel: apache-seatunnel-2.3.2 spark: 3.3.1 clickhouse: 22.8.16.32

SeaTunnel Config

env {
  execution.parallelism = 3
  job.mode = "BATCH"
  spark.sql.catalogImplementation = "hive"
  spark.app.name = "seatunnel-hive-to-ck_xxx"
  spark.yarn.queue = "default"
  spark.executor.instances = 16
  spark.executor.cores = 2
  spark.driver.memory = "3g"
  spark.executor.memory = "14g"
  spark.hadoop.hive.exec.dynamic.partition = "true"
  spark.hadoop.hive.exec.dynamic.partition.mode = "nonstrict"
  spark.sql.sources.partitionOverwriteMode = "dynamic"
  spark.executor.extraJavaOptions = "-Dfile.encoding=UTF-8"
  spark.driver.extraJavaOptions = "-Dfile.encoding=UTF-8"
}

source {

  Hive {
    metastore_uri = "thrift://xx01:9083,thrift://xx02:9083,thrift://xx03:9083"
    table_name = "dm.xxx"
    result_table_name = "soure_table"
    parallelism = 16
  }

}

transform {
Sql {
    source_table_name = "soure_table"
    result_table_name = "sink_table"
    query = "select * from soure_table"
}
}

sink {
Clickhouse {
    host = "xx01:8123,xx02:8123,xx03:8123,xx04:8123,xx05:8123"
    database = "dm"
    table = "xxx"
    username = xx
    parallelism = 16
    password = xxxx
    clickhouse.confg = {
      max_rows_to_read = "100"
      read_overflow_mode = "throw"
          bulk_size = 100000
      retry = 3
    }
   }  
}

Running Command

/usr/local/apache-seatunnel-2.3.2/bin/start-seatunnel-spark-3-connector-v2.sh --master yarn --deploy-mode client --config  /usr/local/apache-seatunnel-2.3.2/config/xxx.conf

Error Exception

I have tried multiple times to import data into ClickHouse, but each time it only writes the data to the first node in the ClickHouse cluster list

Zeta or Flink or Spark Version

No response

Java or Scala Version

No response

Screenshots

No response

Are you willing to submit PR?

[X] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Toroidals commented 1 year ago

org.apache.seatunnel.connectors.seatunnel.clickhouse.sink.client.ClickhouseSink 中写死只使用nodes中的第一个节点nodes.get(0)

Toroidals commented 1 year ago

添加代码： // 随机生成一个索引值 Random random = new Random(); ClickHouseNode clickHouseNode = nodes.get(random.nextInt(nodes.size())); 并替换所有的nodes.get(0)

Carl-Zhou-CN commented 1 year ago

@Toroidals hi， The ClickHouse sink only supports sharding when the table is a distributed table,you can refer to the "split_mode" parameter.

Toroidals commented 1 year ago

@Toroidals hi， The ClickHouse sink only supports sharding when the table is a distributed table,you can refer to the "split_mode" parameter. When using distributed tables for writing in ClickHouse, it is necessary to rely on ZooKeeper. In earlier versions, when there is a large amount of data, it could lead to ZooKeeper crashes. Is it still possible to encounter the same issue in the current version?

Carl-Zhou-CN commented 1 year ago

@Toroidals hi， The ClickHouse sink only supports sharding when the table is a distributed table,you can refer to the "split_mode" parameter. When using distributed tables for writing in ClickHouse, it is necessary to rely on ZooKeeper. In earlier versions, when there is a large amount of data, it could lead to ZooKeeper crashes. Is it still possible to encounter the same issue in the current version?

I don't think there will be this problem because essentially Clickhouse Sink will write to the local table rather than directly from a distributed table

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] commented 11 months ago

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

apache / seatunnel