apache / seatunnel

SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.
https://seatunnel.apache.org/
Apache License 2.0
8.05k stars 1.82k forks source link

[Feature][Transformer] Supporting fake data generation in transformer for sensitive data masking options #7766

Closed YuriyGavrilov closed 1 week ago

YuriyGavrilov commented 1 month ago

Search before asking

Description

Hi All Following to the short discussion I create this issue.

https://github.com/apache/seatunnel/discussions/7746

So there is an idea and goal to source and sink completely full postgres (or etc) to another postgres (source) with data masking or generation fake data for sensitive attributes. Good to know that there are a lot of fakesource available with random generators but at this moment I don't know is it working in transformer or not. Also some good news that there is dynamic compilation available for some completely custom cases.

What do you think?

Usage Scenario

Some maybe will try to use Transformer in case of masking and fake generation. The real case is to make data synchronization from prod to test environment with some predefined option by user request

Related issues

Supporting fake data generation in transformer

Are you willing to submit a PR?

Code of Conduct

Hisoka-X commented 1 month ago

How about support join with dimension table (fake source is one type of dimension table)? I think we can extend this requirement to any source. eg: join with jdbc

transform {

    JoinWithSource {
        join_on = "source.id = type_bin.item_id"
        source = [
            Jdbc {
                url = "jdbc:mysql://localhost/test?serverTimezone=GMT%2b8"
                driver = "com.mysql.cj.jdbc.Driver"
                connection_check_timeout_sec = 100
                user = "root"
                password = "123456"
                query = "select * from type_bin"
            }
        ]
    }
}

or join with fake source

transform {
    JoinWithSource {
        join_on = "source.id = fake.c_int"
        source = [
            FakeSource {
                row.num = 5
                schema {
                    fields {
                    c_string = string
                    c_tinyint = tinyint
                    c_smallint = smallint
                    c_int = int
                    c_bigint = bigint
                    c_float = float
                    c_double = double
                    }
                }
            }
        ]
    }
}

Then we can use SQL transform to filter data you want.

Hisoka-X commented 1 month ago

Or join with sql transform

env {
  parallelism = 10
  job.mode = "BATCH"
}
source {
    Jdbc {
        url = "jdbc:mysql://localhost/test?serverTimezone=GMT%2b8"
        driver = "com.mysql.cj.jdbc.Driver"
        connection_check_timeout_sec = 100
        user = "root"
        password = "123456"
        table_path = "testdb.table1"
        query = "select * from testdb.table1"
        split.size = 10000
    }
    FakeSource {
                row.num = 5
                schema {
                    fields {
                    c_string = string
                    c_tinyint = tinyint
                    c_smallint = smallint
                    c_int = int
                    c_bigint = bigint
                    c_float = float
                    c_double = double
                    }
                }
            }
}

transform {
    sql {
        query = "select * from table1 join table2 on table1.id = table2.id"
    }
}

sink {
  Console {}
}
github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] commented 1 week ago

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.