Updated logic to process array data types in parquet

GoogleCloudPlatform / dlp-dataflow-deidentification

Multi Cloud Data Tokenization Solution By Using Dataflow and Cloud DLP

Apache License 2.0

89 stars 53 forks source link

Summary (Short summary of what is being done) :

Updated logic to process array data types in parquet

Description (Describe in detail the fix made) :

Current logic to process ARRAY data types in parquet is breaking the pipeline. The field name is assigned a null value when the data type is an array. In turn, the pipeline throws "CoderException: cannot encode a null String". For more details, please refer to the attached buganizer ticket.
According the improved implementation, any array data type in parquet structure will be converted to a list of strings and will further be processed by DLP API and written to BQ tables as string with "[" and "]" as start and end characters, respectively, to denote it was originally an ARRAY data type in Parquet. This is to make the implementation logic simpler and can be improved on when the output is required in Parquet format.

Bug ID (if any) :

b/310247478

Public Documentation (if any) :

TESTED (Test Cases with scenario and description - must have 1 positive and 1 negative scenario) :

Tested on the parquet file provided in the ticket.

Codecov Report

Attention: 9 lines in your changes are missing coverage. Please review.

Comparison is base (d02173f) 13.43% compared to head (3c7ce08) 13.41%.

Files	Patch %	Lines
...m/tokenization/parquet/GenericRecordFlattener.java	0.00%	9 Missing :warning:

Files

Patch %

Lines

...m/tokenization/parquet/GenericRecordFlattener.java

0.00%

9 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #183 +/- ## ============================================ - Coverage 13.43% 13.41% -0.03% Complexity 67 67 ============================================ Files 53 53 Lines 2515 2519 +4 Branches 211 213 +2 ============================================ Hits 338 338 - Misses 2157 2161 +4 Partials 20 20 ```

GoogleCloudPlatform / dlp-dataflow-deidentification