[DISCUSS][Feature][core] Add dirty data management

apache / seatunnel

SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.

https://seatunnel.apache.org/

Apache License 2.0

8.08k stars 1.83k forks source link

[DISCUSS][Feature][core] Add dirty data management #988

Closed xleoken closed 2 years ago

xleoken commented 2 years ago

Search before asking

[X] I had searched in the feature and found no similar feature requirement.

Description

We may meet some dirty records when transmitting data, so we may need a dirty data management mechanism to handle them. This issue is under discussing, for free to share your options.

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

zhaomin1423 commented 2 years ago

How is this work going? I am interested in it, and I am willing to subit a PR.

xleoken commented 2 years ago

How is this work going? I am interested in it, and I am willing to subit a PR.

Welcome @zhaomin1423.

zhaomin1423 commented 2 years ago

The dirty data management has two aspect. First, We can handle data one by one, then, the database must support transactions because when writing a batch data with few dirty data, the database must rollback. Therefore, we can write the batch one by one to catch the dirty data. In spark, add a datasource strategy to transform WriteToDataSourceV2 to an extended WriteToDataSourceV2Exec. So, we can handle the data one by one to mange dirty data. Then, to implement a jdbc connector base on DataSourceV2 API.

Welcome to comment.