alan-turing-institute / sqlsynthgen

Synthetic data for SQL databases
MIT License
11 stars 0 forks source link

Add option to derive column value from src dataset (marginals) #24

Closed Iain-S closed 1 year ago

Iain-S commented 1 year ago

The user may want to use the distribution of, for example, src.patient.sex to derive the dst.patient.sex value. We could add noise to existing data (anonymisation) or sample from a distribution (synthetic data). Either way, the amount of noise should be customisable.

Longer term, it would be nice to specify an epsilon or privacy budget in some form. For now, we shall simply:

Iain-S commented 1 year ago

~Todo:~

~This is because the user might be sent a generators file by someone else and may not have access to the src db directly.~

Moved to #53

Iain-S commented 1 year ago

From the discussion this afternoon, we agreed that

mhauru commented 1 year ago

How about the following design? In config.yaml one can specify various statistics to be computed from the source DB, and how much noise needs to be added to them. The outputs of that procedure, the noised statistical descriptors, are available to be used as values for arguments into generators in the part of config.yaml that specifies custom generators. As make-generators is run, it queries the src db for those statistics, adds noise, writes the result to another YAML file somewhere called e.g. src_stats.yaml, and then the generation of the ssg.py file proceeds as usual, except the values from src_stats.yaml are plugged into the arguments of various generators as specified in config.yaml.

The reasons I would like to have the statistics derived from the source data stored in this intermediate src_stats.yaml file are

I'm imagining a config.yaml that could look like the following. This is a modification of the current config YAML in our tests, where I've marked the new parts.

src-stats:   # This block is NEW
  - name: mean_age
    query: SELECT avg(age) FROM person
    noise-level: 0.1
  - name: earliest_birthdate
    query: SELECT min(date-of-birth) FROM person
    noise-level: 0.07
custom_generators_module: custom_generators
tables:
  person:
    num_rows_per_pass: 2
    vocabulary_table: false
    custom_generators:
      - name: generic.person.full_name
        args: null
        columns_assigned: name
      - name: poisson_distribution   # This item is NEW
        args:
          mean: src_stats.mean_age
        columns_assigned: age
      - name: generic.datetime.datetime   # This item is NEW
        args:
          earliest: src_stats.earliest_birthdate
        columns_assigned: date_of_birth
      - name: generic.datetime.datetime
        args:
          start: 2022
          end: 2022
        columns_assigned: stored_from

  hospital_visit:
    num_rows_per_pass: 3
    custom_generators:
      - name: custom_generators.timespan_generator
        args:
          generic: generic
          earliest_start_year: 2021
          last_start_year: 2022
          min_dt_days: 1
          max_dt_days: 30
        columns_assigned:
          - visit_start
          - visit_end
          - visit_duration_seconds
Iain-S commented 1 year ago

It hadn't occurred to me to have the stats in their own file but it makes sense. So, if we wanted to specify in config.yml that the valid values of "smoker" are "smoker", "non-smoker" and "unknown", what do we put for

tables:
    demographics:
        xy: yz

I presume that stats.yml is written during make-generators and read during create-data? In which case, we have some kind of producer that reads from the yml file?

class MyTableGenerator():
    def __init__():
        ...
        smoker = CategoricalStatsProducer("stats.yml")
    ...
mhauru commented 1 year ago

It hadn't occurred to me to have the stats in their own file but it makes sense. So, if we wanted to specify in config.yml that the valid values of "smoker" are "smoker", "non-smoker" and "unknown", what do we put for [...]

Maybe the valid values should be specified in the src-stats block? I feel like it's more a property of the source data and the statistical descriptors computed from that data rather than a property of the table where the descriptors are used. However, I don't know exactly how to include it in a block like

src-stats:
  - name: mean_age
    query: SELECT avg(age) FROM person
    noise-level: 0.1

Might have to change the design of that block? The restriction to having to specify a SQL query may be too strict.

I presume that stats.yml is written during make-generators and read during create-data? In which case, we have some kind of producer that reads from the yml file?

Yes to the first question. I had in mind just hard-coding the same number in ssg.py and stats.yaml, but what you propose would be nicer.