Adapt methods to use data with the standard input format (the generated data uses this format)

crangelsmith commented 4 years ago

The aim is to adapt at least one method to take each dataset in a standard format (csv + json), and emit synthetic data in a specified location

Requirements

[ ] In synth-method, make a subdirectory for each method
[ ] Write a program that takes three command-line arguments:
- the path to a csv file, containing the data
- the path to a json file, containing any information about the data (e.g. discrete vs continuous columns)
- the path to another json file, containing parameters required for the synthesis (e.g. how many points to sample, any method-specific parameters [such as class type for microsimulation-based methods - and which might be ignored for some methods], and a unique name synth-name for the synthesis)
[ ] Writes output, in synthetic-output/dataset-name/method-name:
- a csv file with the synthetic data
- a json file with the metadata

crangelsmith commented 4 years ago

De-identified, multivariate correlated dataset (original): https://github.com/crangelsmith/synthetic-data-tutorial/blob/master/data/hospital_ae_data_deidentify.csv

Json:

attribute_to_datatype = { 'Time in A&E (mins)': 'Integer', 'Treatment': 'String', 'Gender': 'String', 'Index of Multiple Deprivation Decile': 'Integer', 'Hospital ID': 'String', 'Arrival Date': 'String', 'Arrival hour range': 'String',
'Age bracket': 'String' }

attribute_is_categorical = { 'Hospital ID': True, 'Time in A&E (mins)': False, 'Treatment': True, 'Gender': True, 'Index of Multiple Deprivation Decile': False, 'Arrival Date': True, 'Arrival hour range': True,
'Age bracket': True }

gmingas commented 4 years ago

Kasra and I refactored the synthesis code pipeline under CTGAN (previously main functions were contained in utils.py). We also added extra functionality, checks for correct user input and verbosity. The implementation is now in ctgan_main.py. The proposed class structure for different synthesis methods is:

A Base class which contains:
- A read_data function which reads the input data and the .json metadata (both could come from the generation script or from some other source) and returns them. It takes the filepaths for the data and metadata and the synth_name as arguments. Examples of the two input files can be found under CTGAN/tests/data
- The empty functions fit_synthesizer and synthesize.
Child classes inheriting from Base. Each one implements a different synthesis method. Each class needs to contain:
- A read_data function which can call read_data from Base and then can apply extra method-specific data pre-processing if needed.
- An implementation for fit_synthesizer: This calls _read_data and then fits the model to the data using the chosen method (in our case CTGAN). Uses metadata from .json containing method-specific tuning parameters (example under CTGAN/tests/parameters). It returns the fitted model and also stores it within the object.
- An implementation for synthesize: This takes the fitted model and uses it to synthesise the final dataset. This might also use some of the parameter metadata from .json but this needs to be discussed. Returns the final synthetic dataset and can optionally write the dataset to disk. At the moment this does not return output metadata yet.

To run the code for the CTGAN case, go to the CTGAN sub-directory and run the following in bash (note the test code will eventually be removed from this file): > python ctgan_main.py

You need to run pip install ctgan if you do not have the library installed.

gmingas commented 4 years ago

Photo of the whiteboard from yesterday's meeting where we discussed possible directory structures and .json file formats

alan-turing-institute / QUIPP-collab

Adapt methods to use data with the standard input format (the generated data uses this format) #59

Requirements