corneliascode / Automating-synthetic-data-creation-reporting-using-Microsoft-Fabric-

MIT License
6 stars 2 forks source link

Automating synthetic data creation & reporting using Microsoft Fabric

Table of Contents

General Information

When considering privacy protection in the context of an outsourcing company, synthetic data generation becomes particularly relevant due to the sensitive nature of the data involved. Outsourcing companies often handle data from clients that may contain personally identifiable information (PII), financial records, proprietary business information, and other sensitive data.

Here's how synthetic data generation can enhance privacy protection for outsourcing companies:

  1. Compliance with Regulations: Outsourcing companies are often subject to strict data privacy regulations such as GDPR (in the European Union), HIPAA (in healthcare), or CCPA (in California). These regulations impose stringent requirements on how personal data should be handled, processed, and stored. By using synthetic data, outsourcing companies can reduce the need to work directly with real, sensitive data while still complying with regulatory requirements.

  2. Minimization of Data Exposure: Handling real data increases the risk of data breaches and unauthorized access. Even with robust security measures in place, the potential for data leaks remains a concern. Synthetic data generation allows outsourcing companies to minimize the exposure of actual sensitive data by creating realistic yet entirely synthetic datasets for tasks like software development, testing, and analytics.

  3. Secure Collaboration: Outsourcing often involves collaboration with external partners, vendors, or remote teams. Sharing sensitive data with these parties increases the risk of data misuse or breaches. Synthetic data provides a secure alternative for collaboration, as it can be freely shared with external stakeholders without compromising the confidentiality of the original data.

  4. Data Masking and Anonymization: Even within the outsourcing company itself, access to sensitive data may need to be restricted to specific roles or individuals. Synthetic data can be used to mask or anonymize real data, allowing employees who do not require access to sensitive information to work with realistic but non-sensitive datasets. This reduces the risk of internal data breaches and unauthorized access.

  5. Ethical Considerations: In addition to legal compliance, outsourcing companies often have ethical responsibilities to protect the privacy and confidentiality of their clients' data. Synthetic data generation aligns with these ethical considerations by providing a way to fulfill business objectives without compromising individual privacy rights.

By leveraging synthetic data generation techniques, outsourcing companies can effectively manage and mitigate the privacy risks associated with handling sensitive data, thereby building trust with clients, enhancing data security, and ensuring regulatory compliance.

Technologies Used

Demo Instructions

  1. Original Data: First step is to add tables to the Lakehouse. Analyze the tables taking into account the type of the data for each column and save the resulted statistics into a csv.

  2. Synthetic Data: Based on the real dataset, we aim to create a synthetic one, which preserves its underlying structure and reproduces its statistical properties. With this scope in mind, we proposed a solution that, given a lakehouse with multiple tabular datasets, trains a custom Conditional Tabular Generative Adversarial Network (CTGAN) on each of the respective real data and then uses it to sample a fake dataset. The hyperparameters configuration used in the training process is provided by OpenAI, based on the size of the target table. The resulting synthetic data is further compared with the real one, and it can be visualized in the reporting folder for each specific table.

  3. Summary Synthetic Data: After the synthetic data is created we analyze again the data and save the resulted statistics of the synthetic data into a csv.

  4. Decision Tree Classifier & Regressor: As a minimum viable product, we have implemented a Decision Tree Classifier and Regressor to predict the target column of the synthetic data. The model is trained and then used to predict the target column of the synthetic data. In the same time the variable importance is calculated and alltogether with the model metrics are passed to the report.

  5. Report: Then we create a report using the synthetic data and the real data. The report is created in the following steps:

    • Data Distribution: We compare the data distribution of the synthetic data and the real data. We compare the mean, the standard deviation, the minimum, the maximum, and the quantiles of the synthetic data and the real data.
    • Data Correlation: We compare the data correlation of the synthetic data and the real data. We compare the correlation matrix of the synthetic data and the real data.
    • Data Visualization: We visualize the synthetic data and the real data using histograms, scatter plots, and box plots.

All the steps (including parameters setting) are implemented in the mainflow.

pipeline

Setup

variables

foreachloop

Now we take a closer look to each actitivity and to it's settings:

Alt text

Alt text

With the value dict_configurations being:

Alt text

Synthehtic_data_generation : This activity is used to generate the synthetic data. For this activity in the settings point it to the notebook Synthetic_data_generation.

Alt text

With the value dict_configurations being:

Alt text Alt text

Summary_synthetic_data : After the synthetic data is created we analyze again the data and save the resulted statistics of the synthetic data into a csv.

Alt text

Alt text Alt text

Alt text Alt text

Alt text Alt text

Usage

Alt text Alt text

Alt text

The last parameter we set is label_variable. This parameter will define the target column of the synthetic dataset, for each dataset that we want to pass through to the decision tree model.

Alt text

Room for Improvement

Acknowledgements