apache / seatunnel

SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.
https://seatunnel.apache.org/
Apache License 2.0
7.83k stars 1.77k forks source link

[Bug] [Checkpoint] Data duplication when sink with xa transaction restore from checkpoint #5641

Open junmingliu opened 11 months ago

junmingliu commented 11 months ago

Search before asking

What happened

Data duplication when sink with xa transaction restore from checkpoint,server log as blow: server.log

SeaTunnel Version

dev branch,the commit id as blow: image

SeaTunnel Config

env {
  # You can set flink configuration here
  job.mode = "BATCH"
checkpoint.interval ="10000"
      checkpoint.timeout = 9000000
}
source{
    Jdbc {
        url = "jdbc:mysql://XXX:3306/XXX?serverTimezone=GMT%2b8&useCompression=true&useSSL=false&useCursorFetch=true&allowPublicKeyRetrieval=true"
        driver = "com.mysql.cj.jdbc.Driver"
        connection_check_timeout_sec = 100
        user = "XXX"
        password = "XXX"
        partition_column = "id"
        partition_num = 20
        fetch_size = 5000
        query = "select * from indicator_bigdata limit 8000000"
        parallelism = 2
    }
}

transform {
    # If you would like to get more information about how to configure seatunnel and see full list of transform plugins,
    # please go to https://seatunnel.apache.org/docs/transform/sql
}

sink {
    jdbc {
url = "jdbc:postgresql://XXX:5432/postgres"
driver = "org.postgresql.Driver"
user = "XXX"
password = "XXX"
        batch_size = 5000
        batch_inteval_ms = 0
database = postgres
        table = public.indicator_bigdata
        generate_sink_sql = true
is_exactly_once = true
xa_data_source_class_name = "org.postgresql.xa.PGXADataSource"
        max_commit_attempts = 3
        transaction_timeout_sec = 86400
        }
  # If you would like to get more information about how to configure seatunnel and see full list of sink plugins,
  # please go to https://seatunnel.apache.org/docs/category/sink-v2
}

Running Command

firstly,2023-10-16 18:56:40 run as blow:/bin/seatunnel.sh -c ../mysql2pg.template

secondly,2023-10-16 18:58 run as blow:./bin/seatunnel.sh -s 766253013990899713;
The command completed in 2023-10-16 19:03

finally,2023-10-16 19:08 run as blow:./bin/seatunnel.sh -c ../mysql2pg.template -r 766253013990899713

Error Exception

Data is duplicated。
详细:键值"(id)=(2000026)" 已经存在  Call getNextException to see other errors in the batch.

Zeta or Flink or Spark Version

No response

Java or Scala Version

No response

Screenshots

No response

Are you willing to submit PR?

Code of Conduct

github-actions[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.