Add option to derive column value from src dataset (marginals)

Iain-S commented 1 year ago

The user may want to use the distribution of, for example, src.patient.sex to derive the dst.patient.sex value. We could add noise to existing data (anonymisation) or sample from a distribution (synthetic data). Either way, the amount of noise should be customisable.

Longer term, it would be nice to specify an epsilon or privacy budget in some form. For now, we shall simply:

Add a convenient way to specify that the source data should be used to populate the target column.

Iain-S commented 1 year ago

~Todo:~

~Make the src db connection params optional at runtime~

~This is because the user might be sent a generators file by someone else and may not have access to the src db directly.~

Moved to #53

Iain-S commented 1 year ago

From the discussion this afternoon, we agreed that

We already support (entirely) random categorical variables if, for example, the user specifies random.choice(["M", "F", "F"]) as the producer for the gender column.
To support adding noise for DP, we will need to:
- [x] Add a way to specify that a given column needs to be generated differently (in progress with Markus on another branch), the noise and the domain.
- [ ] Have a way to actually add the noise specified to the data in the src database and...
- [ ] Write the results as a generator to the ssg.py output file (which may be in the final form of a call to random.choices)

mhauru commented 1 year ago

How about the following design? In config.yaml one can specify various statistics to be computed from the source DB, and how much noise needs to be added to them. The outputs of that procedure, the noised statistical descriptors, are available to be used as values for arguments into generators in the part of config.yaml that specifies custom generators. As make-generators is run, it queries the src db for those statistics, adds noise, writes the result to another YAML file somewhere called e.g. src_stats.yaml, and then the generation of the ssg.py file proceeds as usual, except the values from src_stats.yaml are plugged into the arguments of various generators as specified in config.yaml.

The reasons I would like to have the statistics derived from the source data stored in this intermediate src_stats.yaml file are

It makes the privacy analysis easier when you have a single file that holds all the data that has been derived from the original data and that might get published in the future. If we are fine publishing src_stats.yaml, we are fine publishing ssg.py.
We could also introduce a new command that only generates src_stats.yaml without running the rest of ssg.py generation, and then conversely have another command that generates ssg.py using a given src_stats.yaml without regenerating the latter. This might make life easier if accessing the source data is hard/restricted.

I'm imagining a config.yaml that could look like the following. This is a modification of the current config YAML in our tests, where I've marked the new parts.

src-stats:   # This block is NEW
  - name: mean_age
    query: SELECT avg(age) FROM person
    noise-level: 0.1
  - name: earliest_birthdate
    query: SELECT min(date-of-birth) FROM person
    noise-level: 0.07
custom_generators_module: custom_generators
tables:
  person:
    num_rows_per_pass: 2
    vocabulary_table: false
    custom_generators:
      - name: generic.person.full_name
        args: null
        columns_assigned: name
      - name: poisson_distribution   # This item is NEW
        args:
          mean: src_stats.mean_age
        columns_assigned: age
      - name: generic.datetime.datetime   # This item is NEW
        args:
          earliest: src_stats.earliest_birthdate
        columns_assigned: date_of_birth
      - name: generic.datetime.datetime
        args:
          start: 2022
          end: 2022
        columns_assigned: stored_from

  hospital_visit:
    num_rows_per_pass: 3
    custom_generators:
      - name: custom_generators.timespan_generator
        args:
          generic: generic
          earliest_start_year: 2021
          last_start_year: 2022
          min_dt_days: 1
          max_dt_days: 30
        columns_assigned:
          - visit_start
          - visit_end
          - visit_duration_seconds

Iain-S commented 1 year ago

It hadn't occurred to me to have the stats in their own file but it makes sense. So, if we wanted to specify in config.yml that the valid values of "smoker" are "smoker", "non-smoker" and "unknown", what do we put for

tables:
    demographics:
        xy: yz

I presume that stats.yml is written during make-generators and read during create-data? In which case, we have some kind of producer that reads from the yml file?

class MyTableGenerator():
    def __init__():
        ...
        smoker = CategoricalStatsProducer("stats.yml")
    ...

mhauru commented 1 year ago

It hadn't occurred to me to have the stats in their own file but it makes sense. So, if we wanted to specify in config.yml that the valid values of "smoker" are "smoker", "non-smoker" and "unknown", what do we put for [...]

Maybe the valid values should be specified in the src-stats block? I feel like it's more a property of the source data and the statistical descriptors computed from that data rather than a property of the table where the descriptors are used. However, I don't know exactly how to include it in a block like

src-stats:
  - name: mean_age
    query: SELECT avg(age) FROM person
    noise-level: 0.1

Might have to change the design of that block? The restriction to having to specify a SQL query may be too strict.

I presume that stats.yml is written during make-generators and read during create-data? In which case, we have some kind of producer that reads from the yml file?

Yes to the first question. I had in mind just hard-coding the same number in ssg.py and stats.yaml, but what you propose would be nicer.

alan-turing-institute / sqlsynthgen

Add option to derive column value from src dataset (marginals) #24