apache / seatunnel

SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.
https://seatunnel.apache.org/
Apache License 2.0
7.59k stars 1.67k forks source link

When extracting hive data to mysql, the parallelism parameter setting is invalid. #6830

Open luckyliush opened 1 month ago

luckyliush commented 1 month ago

Search before asking

What happened

The amount of data is 15 million rows. The speed of extracting hive data to mysql can only reach 13,000 records per second, and it takes nearly 30 minutes to complete the extraction. But I set the seatunnel parallelism parameter to 10. After the data extraction task started, I used the show processlist command on the MySQL client and found that there was only one insert thread. Then I tested extracting hive data to the console. The same amount of data took less than 1 minute.

SeaTunnel Version

2.3.1 and 2.3.5

SeaTunnel Config

env {
  execution.parallelism = 10
  job.mode = "BATCH"
}

source {
 Hive {
table_name = ""
metastore_uri = "${metastore_uri}"
 result_table_name = "Table_test"
hdfs_site_path = ""
hive_site_path = ""
    }
}

transform {
  sql {
    source_table_name="Table_test"
query = "select xxx from Table_test"
result_table_name = "Table_test2"
}
}

sink {
    Jdbc {

url = "${url}"
driver = "${driver}"
user = "${user}"
password = "${password}"
database = "${mysql_db}"
batch_size = 20000
table = ""
query = "insert into table_name (xxx) values(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"
    }
}

Running Command

I use the seatunnel module on dolphinscheduler to run in cluster mode.
I also submitted from the command line using: ./bin/seatunnel.sh --config ./config/test.config

Error Exception

No exception occurred, just did not meet expectations

Zeta or Flink or Spark Version

Zeta

Java or Scala Version

1.8

Screenshots

No response

Are you willing to submit PR?

Code of Conduct

liunaijie commented 1 month ago

try to add rewriteBatchedStatements=true parameter to your jdbc url

luckyliush commented 1 month ago

try to add rewriteBatchedStatements=true parameter to your jdbc url

Thank you, but this parameter has been added before,did not meet expectations

liunaijie commented 1 month ago

try to add rewriteBatchedStatements=true parameter to your jdbc url

Thank you, but this parameter has been added before,did not meet expectations

plugin_name = jdbc
user = xxxxx
url = "jdbc:mysql://xxxxxx/xxxxxxx?allowMultiQueries=true&useUnicode=true&characterEncoding=UTF-8&serverTimezone=Asia/Shanghai&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true&useSSL=false"
enable_upsert = true
generate_sink_sql = true
database = db_name
table = table_name
primary_keys = [xxx,xxx]

try with this config, it will auto generate insert sql, i use this config, the write speed is good

luckyliush commented 1 month ago

try to add rewriteBatchedStatements=true parameter to your jdbc url

Thank you, but this parameter has been added before,did not meet expectations

plugin_name = jdbc
user = xxxxx
url = "jdbc:mysql://xxxxxx/xxxxxxx?allowMultiQueries=true&useUnicode=true&characterEncoding=UTF-8&serverTimezone=Asia/Shanghai&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true&useSSL=false"
enable_upsert = true
generate_sink_sql = true
database = db_name
table = table_name
primary_keys = [xxx,xxx]

try with this config, it will auto generate insert sql, i use this config, the write speed is good

`env { execution.parallelism = 10 job.mode = "BATCH" }

source { Hive { table_name = "" metastore_uri = "" result_table_name = "Table_test" hdfs_site_path = "/home/hadoop/hadoop-3.2.2/etc/hadoop/hdfs-site.xml" hive_site_path = "/home/hadoop/hive-2.3.9/conf/hive-site.xml" } }

transform { sql { source_table_name="Table_test" query = "select xxx,xxx from Table_test" result_table_name = "Table_test2" } }

sink { Jdbc {

url = "jdbc:mysql://xxx:3306/xxx?allowMultiQueries=true&useUnicode=true&characterEncoding=UTF-8&serverTimezone=Asia/Shanghai&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true&useSSL=false" driver = "com.mysql.cj.jdbc.Driver" user = "root" enable_upsert = true generate_sink_sql = true password = "xxx" database = "xxx" primary_keys = [xxx,xxx,xxx] table = "xxx" } }`

The version I am using is 2.3.1, and the configuration is as shown above, but the extraction speed is the same as before and has not improved. Is there something wrong with my configuration? Approximately 12,000 pieces of data can be extracted per second. I passed the primary key parameters based on the granular fields of the hive table, but these fields are not set as primary keys in the mysql table. Does this have any impact?

luckyliush commented 1 month ago

try to add rewriteBatchedStatements=true parameter to your jdbc url

Thank you, but this parameter has been added before,did not meet expectations

plugin_name = jdbc
user = xxxxx
url = "jdbc:mysql://xxxxxx/xxxxxxx?allowMultiQueries=true&useUnicode=true&characterEncoding=UTF-8&serverTimezone=Asia/Shanghai&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true&useSSL=false"
enable_upsert = true
generate_sink_sql = true
database = db_name
table = table_name
primary_keys = [xxx,xxx]

try with this config, it will auto generate insert sql, i use this config, the write speed is good

Hello, now in the seatunnel-2.3.5 version, using the same configuration, the parallelism parameter will not take effect. But after adding the parameter read_limit.rows_per_second=10000 to seatunnel-2.3.5, the parallelism parameter will take effect and the extraction speed will be significantly improved. Do you know the reason?

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

Dsrong commented 3 weeks ago

I had some issue at 2.3.5 version,the parallelism parameter is not work,only one parallelism working