datacleaner / DataCleaner

The premier open source Data Quality solution
GNU Lesser General Public License v3.0
595 stars 180 forks source link

TDM- Scrubbing and Synthetic Data requirement #1935

Closed AbhijitDongre321 closed 2 years ago

AbhijitDongre321 commented 2 years ago

Hi Team,

Please find below the use-case which we are looking to implement in front of Data Scrubbing and Synthetic Data generation

Scrubbing/masking a. Read data files from production s3 bucket b. Scrub the data from files containing sensitive data c. Replace scrubbed data with masked/synthetic data d. Validate that files do not contain any original data e. Validate synthetic data generated matches original schema f. Validate Meta data (number of rows, number of columns etc. matches ) g. Generate statistics on scrubbing operation h. Copy final data to alternate s3 bucket

Synthetic Data generation
a. read data model from user- schema, sample data and custom field information- how data should be generated
b. Validate that files do not contain any original data c. Validate synthetic data generated matches provide data model schema d. Validate data Meta data (number of rows, number of columns etc. matches ) e. Generate statistics on data generation f. Copy final data to s3 bucket

Please find below our queries on tool 1a. What are the types of files it supports?

  1. Does it support for scrubbing and Synthetic data generation
  2. Does it support for Validation and if yes then what kinds of validation
  3. Does it support for AWS-S3 connectivity
  4. What kind of algorithm it uses
  5. Does it support for Snowflake and Redshift connectivity
  6. What maximum size of file it supports- we have requirement of around ~100GB
  7. Can we support for PII, Parquet formatted files
  8. Which programming languages it supports