alan-turing-institute / QUIPP-collab

Collaboration on the QUIPP project
1 stars 1 forks source link

Adapt methods to use data with the standard input format (the generated data uses this format) #59

Closed crangelsmith closed 4 years ago

crangelsmith commented 4 years ago

The aim is to adapt at least one method to take each dataset in a standard format (csv + json), and emit synthetic data in a specified location

Requirements

crangelsmith commented 4 years ago

De-identified, multivariate correlated dataset (original): https://github.com/crangelsmith/synthetic-data-tutorial/blob/master/data/hospital_ae_data_deidentify.csv

Json:

attribute_to_datatype = { 'Time in A&E (mins)': 'Integer', 'Treatment': 'String', 'Gender': 'String', 'Index of Multiple Deprivation Decile': 'Integer', 'Hospital ID': 'String', 'Arrival Date': 'String', 'Arrival hour range': 'String',
'Age bracket': 'String' }

attribute_is_categorical = { 'Hospital ID': True, 'Time in A&E (mins)': False, 'Treatment': True, 'Gender': True, 'Index of Multiple Deprivation Decile': False, 'Arrival Date': True, 'Arrival hour range': True,
'Age bracket': True }

gmingas commented 4 years ago

Kasra and I refactored the synthesis code pipeline under CTGAN (previously main functions were contained in utils.py). We also added extra functionality, checks for correct user input and verbosity. The implementation is now in ctgan_main.py. The proposed class structure for different synthesis methods is:

To run the code for the CTGAN case, go to the CTGAN sub-directory and run the following in bash (note the test code will eventually be removed from this file): > python ctgan_main.py

You need to run pip install ctgan if you do not have the library installed.

gmingas commented 4 years ago

Photo of the whiteboard from yesterday's meeting where we discussed possible directory structures and .json file formats

image