GoogleCloudPlatform / dlp-dataflow-deidentification

Multi Cloud Data Tokenization Solution By Using Dataflow and Cloud DLP
Apache License 2.0
89 stars 53 forks source link

DLP ORC support: part1 #163

Closed chitara-01 closed 11 months ago

chitara-01 commented 11 months ago

Summary (Short summary of what is being done) :

Support for inspection and de-identification of ORC data stored in GCS buckets

Description (Describe in detail the fix made) :

The implementation is to read ORC files from GCS storage buckets to process the data using inspection and de-identification pipelines. The results are written in BugQuery tables. This is part one of the DLP ORC support project. Future work includes writing results as ORC files in GCS buckets and re-identify the data stored in de-identified ORC files.

Bug ID (if any) :

301563096

Public Documentation (if any) :


TESTED (Test Cases with scenario and description - must have 1 positive and 1 negative scenario) :

  1. Wrote the script to convert CSV data from CCRecords sample files to ORC format. The implementation worked successfully for inspection and de-identification.
  2. Created unit tests for the implementation.
  3. CI tests: to be added.