GoogleCloudPlatform / dlp-dataflow-deidentification

Multi Cloud Data Tokenization Solution By Using Dataflow and Cloud DLP
Apache License 2.0
89 stars 53 forks source link

Orc support #170

Closed chitara-01 closed 10 months ago

chitara-01 commented 10 months ago

Summary (Short summary of what is being done) :

Write ORC results to GCS buckets.

Description (Describe in detail the fix made) :

Support to write ORC results to GCS buckets after de-identification. File schema to create ORC writer is generated by ExtractFileSchemaTransform as a mapping of filename to its file schema.
Converted CSV sample dataset (CCRecords) to ORC format and added to mock-data folder for reference.

Bug ID (if any) :

b/301563260

Public Documentation (if any) :


TESTED (Test Cases with scenario and description - must have 1 positive and 1 negative scenario) :

  1. Unit tests to be added.
  2. CI changes to be added.
codecov[bot] commented 10 months ago

Codecov Report

Merging #170 (deb71bd) into master (abae253) will decrease coverage by 0.61%. The diff coverage is 0.00%.

@@             Coverage Diff              @@
##             master     #170      +/-   ##
============================================
- Coverage     13.06%   12.45%   -0.61%     
  Complexity       63       63              
============================================
  Files            51       53       +2     
  Lines          2365     2480     +115     
  Branches        202      207       +5     
============================================
  Hits            309      309              
- Misses         2037     2152     +115     
  Partials         19       19              
Files Coverage Δ
...n/DLPTextToBigQueryStreamingV2PipelineOptions.java 0.00% <ø> (ø)
...m/google/swarm/tokenization/orc/ORCReaderDoFn.java 37.83% <0.00%> (ø)
...m/google/swarm/tokenization/common/WriteToGCS.java 0.00% <0.00%> (ø)
...m/tokenization/orc/ExtractFileSchemaTransform.java 0.00% <0.00%> (ø)
...arm/tokenization/DLPTextToBigQueryStreamingV2.java 0.00% <0.00%> (ø)
...m/google/swarm/tokenization/orc/ORCWriterDoFn.java 0.00% <0.00%> (ø)

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more