apache / seatunnel

SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.
https://seatunnel.apache.org/
Apache License 2.0
8.08k stars 1.83k forks source link

[DISCUSS][Feature][core] Add dirty data management #988

Closed xleoken closed 2 years ago

xleoken commented 2 years ago

Search before asking

Description

We may meet some dirty records when transmitting data, so we may need a dirty data management mechanism to handle them. This issue is under discussing, for free to share your options.

Are you willing to submit a PR?

Code of Conduct

zhaomin1423 commented 2 years ago

How is this work going? I am interested in it, and I am willing to subit a PR.

xleoken commented 2 years ago

How is this work going? I am interested in it, and I am willing to subit a PR.

Welcome @zhaomin1423.

zhaomin1423 commented 2 years ago

The dirty data management has two aspect. First, We can handle data one by one, then, the database must support transactions because when writing a batch data with few dirty data, the database must rollback. Therefore, we can write the batch one by one to catch the dirty data. In spark, add a datasource strategy to transform WriteToDataSourceV2 to an extended WriteToDataSourceV2Exec. So, we can handle the data one by one to mange dirty data. Then, to implement a jdbc connector base on DataSourceV2 API.

Welcome to comment.