datasciencecampus / synthetic-data

Repo on generating synthetic data using GAN
7 stars 3 forks source link

Develop Test plan #9

Open SharonHill opened 6 years ago

SharonHill commented 6 years ago

should be reviewed/input by Methodology

Yiannis20 commented 6 years ago

Synthetic data quality assessment methodology

Our initial testing strategy will be based on the method proposed in [1]. Therefore, to assess the quality of the generated synthetic data we will work as follows:

  1. Evaluate the Pearson correlation between the variables in the real data.
  2. Evaluate the Pearson correlation between the variables in the synthetic data.
  3. Investigate whether the Pearson correlation structure of the real data is closely reflected by the correlation structure of the synthetic data.

An output of the proposed testing methodology will be a correlation matrix similar to the one described in [1] (Fig. 1).

image Fig. 1: The correlation matrix constructed in [1] as part of their synthetic data quality assessment methodology.

[1] Brett K. Beaulieu-Jones, Zhiwei Steven Wu, Chris Williams, Ran Lee, Sanjeev P Bhavnani, James Brian Byrd, Casey S. Greene. Privacy-preserving generative deep neural networks support clinical data sharing, bioRxiv 159756; doi: https://doi.org/10.1101/159756