Closed Iain-S closed 1 year ago
~Todo:~
~This is because the user might be sent a generators file by someone else and may not have access to the src db directly.~
Moved to #53
From the discussion this afternoon, we agreed that
random.choice(["M", "F", "F"])
as the producer for the gender column.How about the following design? In config.yaml
one can specify various statistics to be computed from the source DB, and how much noise needs to be added to them. The outputs of that procedure, the noised statistical descriptors, are available to be used as values for arguments into generators in the part of config.yaml
that specifies custom generators. As make-generators
is run, it queries the src db for those statistics, adds noise, writes the result to another YAML file somewhere called e.g. src_stats.yaml
, and then the generation of the ssg.py
file proceeds as usual, except the values from src_stats.yaml
are plugged into the arguments of various generators as specified in config.yaml
.
The reasons I would like to have the statistics derived from the source data stored in this intermediate src_stats.yaml
file are
src_stats.yaml
, we are fine publishing ssg.py
.src_stats.yaml
without running the rest of ssg.py
generation, and then conversely have another command that generates ssg.py
using a given src_stats.yaml
without regenerating the latter. This might make life easier if accessing the source data is hard/restricted.I'm imagining a config.yaml
that could look like the following. This is a modification of the current config YAML in our tests, where I've marked the new parts.
src-stats: # This block is NEW
- name: mean_age
query: SELECT avg(age) FROM person
noise-level: 0.1
- name: earliest_birthdate
query: SELECT min(date-of-birth) FROM person
noise-level: 0.07
custom_generators_module: custom_generators
tables:
person:
num_rows_per_pass: 2
vocabulary_table: false
custom_generators:
- name: generic.person.full_name
args: null
columns_assigned: name
- name: poisson_distribution # This item is NEW
args:
mean: src_stats.mean_age
columns_assigned: age
- name: generic.datetime.datetime # This item is NEW
args:
earliest: src_stats.earliest_birthdate
columns_assigned: date_of_birth
- name: generic.datetime.datetime
args:
start: 2022
end: 2022
columns_assigned: stored_from
hospital_visit:
num_rows_per_pass: 3
custom_generators:
- name: custom_generators.timespan_generator
args:
generic: generic
earliest_start_year: 2021
last_start_year: 2022
min_dt_days: 1
max_dt_days: 30
columns_assigned:
- visit_start
- visit_end
- visit_duration_seconds
It hadn't occurred to me to have the stats in their own file but it makes sense. So, if we wanted to specify in config.yml that the valid values of "smoker" are "smoker", "non-smoker" and "unknown", what do we put for
tables:
demographics:
xy: yz
I presume that stats.yml is written during make-generators
and read during create-data
? In which case, we have some kind of producer that reads from the yml file?
class MyTableGenerator():
def __init__():
...
smoker = CategoricalStatsProducer("stats.yml")
...
It hadn't occurred to me to have the stats in their own file but it makes sense. So, if we wanted to specify in config.yml that the valid values of "smoker" are "smoker", "non-smoker" and "unknown", what do we put for [...]
Maybe the valid values should be specified in the src-stats
block? I feel like it's more a property of the source data and the statistical descriptors computed from that data rather than a property of the table where the descriptors are used. However, I don't know exactly how to include it in a block like
src-stats:
- name: mean_age
query: SELECT avg(age) FROM person
noise-level: 0.1
Might have to change the design of that block? The restriction to having to specify a SQL query may be too strict.
I presume that stats.yml is written during make-generators and read during create-data? In which case, we have some kind of producer that reads from the yml file?
Yes to the first question. I had in mind just hard-coding the same number in ssg.py
and stats.yaml
, but what you propose would be nicer.
The user may want to use the distribution of, for example,
src.patient.sex
to derive thedst.patient.sex
value. We could add noise to existing data (anonymisation) or sample from a distribution (synthetic data). Either way, the amount of noise should be customisable.Longer term, it would be nice to specify an epsilon or privacy budget in some form. For now, we shall simply: