GoogleCloudPlatform / dlp-dataflow-deidentification

Multi Cloud Data Tokenization Solution By Using Dataflow and Cloud DLP
Apache License 2.0
89 stars 53 forks source link

Support to INSPECT & DEID parquet files #157

Closed chitara-01 closed 11 months ago

chitara-01 commented 12 months ago

Summary (Short summary of what is being done) :

Added support to INSPECT & DEID parquet files from GCS bucket and store results in BQ datasets.

Description (Describe in detail the fix made) :

Introducing a dedicated java package to read the data from parquet files as GenericRecord objects, flatten each record and convert to Table.Row objects for further processing. This change works for inspection and de-identification of input files stored in GCS storage buckets. The results are written in BigQuery datasets. The tables from the BigQuery datasets can be further re-identified in the usual manner.

Bug ID (if any) :

b/293426633

Public Documentation (if any) :


TESTED (Test Cases with scenario and description - must have 1 positive and 1 negative scenario) :

  1. Converted CCRecords sample data from CSV format to Parquet format and tested both inspection and de-identification pipelines.
  2. Tested on parquet data from this github repository
  3. Tested on various other parquet sample data from sources like kaggle, github, etc.
  4. Introduced CI testing for Parquet pipeline - both inspection and de-identification.
  5. Updated with PR suggestions/corrections and tested.