apache / seatunnel

SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.
https://seatunnel.apache.org/
Apache License 2.0
7.82k stars 1.76k forks source link

Export hive data to clickhouse cluster by seatunnel, and the data is always imported to only one clickhouse node. #5435

Closed Toroidals closed 11 months ago

Toroidals commented 1 year ago

Search before asking

What happened

Export hive data to clickhouse cluster by seatunnel, and the data is always imported to only one clickhouse node.

SeaTunnel Version

seatunnel: apache-seatunnel-2.3.2 spark: 3.3.1 clickhouse: 22.8.16.32

SeaTunnel Config

env {
  execution.parallelism = 3
  job.mode = "BATCH"
  spark.sql.catalogImplementation = "hive"
  spark.app.name = "seatunnel-hive-to-ck_xxx"
  spark.yarn.queue = "default"
  spark.executor.instances = 16
  spark.executor.cores = 2
  spark.driver.memory = "3g"
  spark.executor.memory = "14g"
  spark.hadoop.hive.exec.dynamic.partition = "true"
  spark.hadoop.hive.exec.dynamic.partition.mode = "nonstrict"
  spark.sql.sources.partitionOverwriteMode = "dynamic"
  spark.executor.extraJavaOptions = "-Dfile.encoding=UTF-8"
  spark.driver.extraJavaOptions = "-Dfile.encoding=UTF-8"
}

source {

  Hive {
    metastore_uri = "thrift://xx01:9083,thrift://xx02:9083,thrift://xx03:9083"
    table_name = "dm.xxx"
    result_table_name = "soure_table"
    parallelism = 16
  }

}

transform {
Sql {
    source_table_name = "soure_table"
    result_table_name = "sink_table"
    query = "select * from soure_table"
}
}

sink {
Clickhouse {
    host = "xx01:8123,xx02:8123,xx03:8123,xx04:8123,xx05:8123"
    database = "dm"
    table = "xxx"
    username = xx
    parallelism = 16
    password = xxxx
    clickhouse.confg = {
      max_rows_to_read = "100"
      read_overflow_mode = "throw"
          bulk_size = 100000
      retry = 3
    }
   }  
}

Running Command

/usr/local/apache-seatunnel-2.3.2/bin/start-seatunnel-spark-3-connector-v2.sh --master yarn --deploy-mode client --config  /usr/local/apache-seatunnel-2.3.2/config/xxx.conf

Error Exception

I have tried multiple times to import data into ClickHouse, but each time it only writes the data to the first node in the ClickHouse cluster list

Zeta or Flink or Spark Version

No response

Java or Scala Version

No response

Screenshots

No response

Are you willing to submit PR?

Code of Conduct

Toroidals commented 1 year ago

image org.apache.seatunnel.connectors.seatunnel.clickhouse.sink.client.ClickhouseSink 中写死只使用nodes中的第一个节点nodes.get(0)

Toroidals commented 1 year ago

image 添加代码: // 随机生成一个索引值 Random random = new Random(); ClickHouseNode clickHouseNode = nodes.get(random.nextInt(nodes.size())); 并替换所有的nodes.get(0)

Carl-Zhou-CN commented 1 year ago

@Toroidals hi, The ClickHouse sink only supports sharding when the table is a distributed table,you can refer to the "split_mode" parameter.

Toroidals commented 1 year ago

@Toroidals hi, The ClickHouse sink only supports sharding when the table is a distributed table,you can refer to the "split_mode" parameter. When using distributed tables for writing in ClickHouse, it is necessary to rely on ZooKeeper. In earlier versions, when there is a large amount of data, it could lead to ZooKeeper crashes. Is it still possible to encounter the same issue in the current version?

Carl-Zhou-CN commented 1 year ago

@Toroidals hi, The ClickHouse sink only supports sharding when the table is a distributed table,you can refer to the "split_mode" parameter. When using distributed tables for writing in ClickHouse, it is necessary to rely on ZooKeeper. In earlier versions, when there is a large amount of data, it could lead to ZooKeeper crashes. Is it still possible to encounter the same issue in the current version?

I don't think there will be this problem because essentially Clickhouse Sink will write to the local table rather than directly from a distributed table

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] commented 11 months ago

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.