GreenmaskIO / greenmask

PostgreSQL database anonymization and synthetic data generation tool
https://greenmask.io
Apache License 2.0
1.14k stars 21 forks source link

Data Generator Request: RandomCity #144

Open jensenbox opened 4 months ago

jensenbox commented 4 months ago

I know you call them transformers but for some reason in my mind they just seem closer to data generators than something that transforms :)

Anyway, I am working on a table that has the address broken up like:

    address_line_1 character varying(1024),
    address_line_2 character varying(1024),
    city character varying(255),
    postal_code character varying(20),
    region character varying(255),
    country character varying(2),

I can use RealAddress for the address_line_1 and random data for the others but it would be nice to have city be something interesting.

wwoytenko commented 4 months ago

I know you call them transformers but for some reason in my mind, they just seem closer to data generators than something that transforms :)

Hi! Thank your feedback. I will consider the naming, but both namings are controversial, so we need to choose the user-friendly. Maybe I will raise a vote in the future)

I named it transformer because some transformer changes original data rather than generate new ones. For instance, in the latest beta, you can generate random email and keeping part of email as was in original value:

- schema: "public"
  name: "account"
  transformers:
    - name: "RandomEmail"
      params:
        column: "email"
        engine: "hash"
        keep_original_domain: true
        local_part_template: "{{ first_name | lower }}.{{ last_name | lower }}.{{ .random_string | trunc 10 }}"
wwoytenko commented 4 months ago

I can use RealAddress for the address_line_1 and random data for the others but it would be nice to have city be something interesting.

Good note, agree. I will try to make the RealAddress generate for useful according to your feedback.

Well I have an Idea when people can provide their own addresses or any other databases with data, for instance in json representation. The Greenmask would use that data for mapping to the columns. For instance.

- schema: "public"
  name: "account_address"
  transformers:
    - name: "RandomDataFromFile"
      params:
        file: "/path/to/your/db.json"
        columns:
          - name: "address_line_1"
            value: "{{ db.address_line1 }}"
          - name: "city"
            value: "{{ db.city }}"

And in the file might be kind of

[
 {
   "address_line_1": "val1",
   "address_line_2": "val2",
   "city": "val3",
   "postal_code": "val4",
   "region": "val5",
   "country": "val6",
 }
]

Why this way? I think this might be used not only for address but for multipurpose. Allowing users to define their own functional dependencies between attribute in the database provided.

jensenbox commented 4 months ago

I know you call them transformers but for some reason in my mind, they just seem closer to data generators than something that transforms :)

Hi! Thank your feedback. I will consider the naming, but both namings are controversial, so we need to choose the user-friendly. Maybe I will raise a vote in the future)

I named it transformer because some transformer changes original data rather than generate new ones. For instance, in the latest beta, you can generate random email and keeping part of email as was in original value:

- schema: "public"
  name: "account"
  transformers:
    - name: "RandomEmail"
      params:
        column: "email"
        engine: "hash"
        keep_original_domain: true
        local_part_template: "{{ first_name | lower }}.{{ last_name | lower }}.{{ .random_string | trunc 10 }}"

I was actually thinking the same thing when I wrote it - I see both sides for sure. There are data generators and data transformers (or mutators) - When I thought of how the documentation would be written it did not make sense to put them in two sections either - so there should be a good name for both of course.

I asked the AI God what it though: image

jensenbox commented 4 months ago

I can use RealAddress for the address_line_1 and random data for the others but it would be nice to have city be something interesting.

Good note, agree. I will try to make the RealAddress generate for useful according to your feedback.

Well I have an Idea when people can provide their own addresses or any other databases with data, for instance in json representation. The Greenmask would use that data for mapping to the columns. For instance.

- schema: "public"
  name: "account_address"
  transformers:
    - name: "RandomDataFromFile"
      params:
        file: "/path/to/your/db.json"
        columns:
          - name: "address_line_1"
            value: "{{ db.address_line1 }}"
          - name: "city"
            value: "{{ db.city }}"

And in the file might be kind of

[
 {
   "address_line_1": "val1",
   "address_line_2": "val2",
   "city": "val3",
   "postal_code": "val4",
   "region": "val5",
   "country": "val6",
 }
]

Why this way? I think this might be used not only for address but for multipurpose. Allowing users to define their own functional dependencies between attribute in the database provided.

For ease of use, you could even replace the file with a yaml array of values. They would of course have to evaluate down to strings but you could do this with yaml anchors so you could re-use it in other parts of the configuration file.