Please find below the use-case which we are looking to implement in front of Data Scrubbing and Synthetic Data generation
Scrubbing/masking
a. Read data files from production s3 bucket
b. Scrub the data from files containing sensitive data
c. Replace scrubbed data with masked/synthetic data
d. Validate that files do not contain any original data
e. Validate synthetic data generated matches original schema
f. Validate Meta data (number of rows, number of columns etc. matches )
g. Generate statistics on scrubbing operation
h. Copy final data to alternate s3 bucket
Synthetic Data generation
a. read data model from user- schema, sample data and custom field information- how data should be generated
b. Validate that files do not contain any original data
c. Validate synthetic data generated matches provide data model schema
d. Validate data Meta data (number of rows, number of columns etc. matches )
e. Generate statistics on data generation
f. Copy final data to s3 bucket
Please find below our queries on tool
1a. What are the types of files it supports?
Does it support for scrubbing and Synthetic data generation
Does it support for Validation and if yes then what kinds of validation
Does it support for AWS-S3 connectivity
What kind of algorithm it uses
Does it support for Snowflake and Redshift connectivity
What maximum size of file it supports- we have requirement of around ~100GB
Hi Team,
Please find below the use-case which we are looking to implement in front of Data Scrubbing and Synthetic Data generation
Scrubbing/masking a. Read data files from production s3 bucket b. Scrub the data from files containing sensitive data c. Replace scrubbed data with masked/synthetic data d. Validate that files do not contain any original data e. Validate synthetic data generated matches original schema f. Validate Meta data (number of rows, number of columns etc. matches ) g. Generate statistics on scrubbing operation h. Copy final data to alternate s3 bucket
Synthetic Data generation
a. read data model from user- schema, sample data and custom field information- how data should be generated
b. Validate that files do not contain any original data c. Validate synthetic data generated matches provide data model schema d. Validate data Meta data (number of rows, number of columns etc. matches ) e. Generate statistics on data generation f. Copy final data to s3 bucket
Please find below our queries on tool 1a. What are the types of files it supports?