apache / seatunnel

SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.
https://seatunnel.apache.org/
Apache License 2.0
7.82k stars 1.76k forks source link

[Bug] [Module Name] Hive2CK using CK File generate data_local.log slow #5819

Closed viverlxl closed 10 months ago

viverlxl commented 10 months ago

Search before asking

What happened

using CK File sync hive data to ck,total data count 2.3KW,using CK JDBC 40m, using CK File 8h。

From Spark Log, know generate data_local.log spend 7h, from data_local.log to CK table file spend 5h

SeaTunnel Version

2.3.3

SeaTunnel Config

env {
  spark.app.name = "hive_to_ck_file"
  spark.executor.instances = 4
  spark.executor.cores = 1
  spark.executor.memory = "3g"
  // This configuration is required
  spark.sql.catalogImplementation = "hive"
  spark.executor.extraJavaOptions = "-Dfile.encoding=UTF-8"
  spark.driver.extraJavaOptions = "-Dfile.encoding=UTF-8"
  spark.hadoop.hive.exec.dynamic.partition = "true"
  spark.hadoop.hive.exec.dynamic.partition.mode = "nonstrict"
  spark.debug.maxToStringFields = 100000
}
source {
    hive {
        table_name = "dms_ddcx.xx_table"
        metastore_uri = "thrift://xxxxx:9083"
        result_table_name = "table"
        parallelism = 4
        read_partitions = ["dt=2023-10-30"]
    }
}
transform {}
sink {
    ClickhouseFile {
        host = "xxx:8123"
        server_time_zone = "Asia/Shanghai"
        database = "dms_ddcx"
        parallelism = 4
        table = "xxxx"
        source_table_name = "xxxx"
        sharding_key = "diversion_id"
        username = "default"
        password = ""
        node_free_password = true
        clickhouse_local_path = "/opt/software/clickhouse local"
        node_pass = []
    }
}

Running Command

./bin/start-seatunnel-spark-3-connector-v2.sh --master yarn --deploy-mode cluster --config config/hive_to_ck_test.config

Error Exception

no error

Zeta or Flink or Spark Version

spark 3.2.4

Java or Scala Version

java 1.8

Screenshots

image

Are you willing to submit PR?

Code of Conduct

viverlxl commented 10 months ago

use mmap buffer can solve this issue

image

写2.3kw到data_local.log, 从原来的5小时变成4分钟。表象来看是一条数据初始化一个mmap带来的性能问题

viverlxl commented 10 months ago
image

原代码和修改后的代码做对比

viverlxl commented 10 months ago

@Hisoka-X can i fix this issue

Hisoka-X commented 10 months ago

Sure! Looking forward your PR! @viverlxl

viverlxl commented 10 months ago

after fix this issue

image

2.3kw records generate data_local.log file is 4 minutes