apache / seatunnel

SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.
https://seatunnel.apache.org/
Apache License 2.0
8.09k stars 1.83k forks source link

[Bug] [hive-sink] 2PC Task failure processing logic, class HiveSinkAggregatedCommitter method abort #4920

Closed WilliamTan778 closed 1 year ago

WilliamTan778 commented 1 year ago

Search before asking

What happened

image

dropPartitions Although the data has not been deleted, it may cause the previous partition to no longer exist and the business cannot use it. Should the partition not be deleted

SeaTunnel Version

2.3.1

SeaTunnel Config

env {
  execution.parallelism = 3
  job.name="test_hive_source_to_hive"
}

source {
  Hive {
    result_table_name = "fake"
    table_name = "default.test_hive_source"
    metastore_uri =  "thrift://172.30.1.185:9083"
  }
}

transform {
 Sql {
    source_table_name = "fake"
    result_table_name = "fake1"
    query = "select test_tinyint,test_smallint,test_int,test_bigint, test_boolean,test_float,test_double,test_string, test_binary,test_timestamp,test_decimal,test_char,test_varchar,test_date,    test_par1 as test_par2,test_par2 as test_par1 from fake"
  }
}

sink {
  Hive {
    source_table_name = "fake1"
    table_name = "default.test_hive_sink_text_simple1"
    metastore_uri =  "thrift://172.30.1.185:9083"
  }
}

Running Command

./bin/start-seatunnel-spark-3-connector-v2.sh \
--master "local[*]" \
--deploy-mode client \
--config test/hive.conf

Error Exception

no error, Logical processing problems

Flink or Spark Version

spark-3.3.0-bin-hadoop2

Java or Scala Version

Corretto-11.0.19.7.1

Screenshots

No response

Are you willing to submit PR?

Code of Conduct

zhilinli123 commented 1 year ago

PTAL: @TyrantLucifer

lightzhao commented 1 year ago

+1 Delete a partition will put all the files under the partition into the recycle bin. This action is really heavy and there is a risk of data loss.

zhilinli123 commented 1 year ago

+1 Delete a partition will put all the files under the partition into the recycle bin. This action is really heavy and there is a risk of data loss.

Currently, the underlying data information is not deleted, only the metadata information of the partition is deleted, but I do not understand why only the metadata is deleted, what is the purpose of it? PATL: @TyrantLucifer

lightzhao commented 1 year ago

+1 Delete a partition will put all the files under the partition into the recycle bin. This action is really heavy and there is a risk of data loss.

Currently, the underlying data information is not deleted, only the metadata information of the partition is deleted, but I do not understand why only the metadata is deleted, what is the purpose of it? PATL: @TyrantLucifer

The dropPartition method will move data to the trash, not just metadata.

EricJoy2048 commented 1 year ago

Yeah, Drop partition is not a good way. Some times people may only append data to exists partitions. If we drop the partition, may delete the data file not write by this job.

Can you update the code and fix it?

lightzhao commented 1 year ago

Yeah, Drop partition is not a good way. Some times people may only append data to exists partitions. If we drop the partition, may delete the data file not write by this job.

ok, I try to fix it.

TyrantLucifer commented 1 year ago

IMO,when user write new partitions this action is useful,2pc has all the lifecycle,add a parameter to control it is better,not remove it.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] commented 1 year ago

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.