apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.48k stars 1.37k forks source link

Add 'mask' command to parquet-tools/parquet-cli #2455

Open asfimport opened 4 years ago

asfimport commented 4 years ago

Some personal data columns need to be masked instead of being pruned(Parquet-1791). We need a tool to replace the raw data columns with masked value. The masked value could be hash, null, redact etc.  For the unchanged columns, they should be moved as a whole like 'merge', 'prune' command in Parquet-tools. 

 

Implementing this feature in file format is 10X faster than doing it by rewriting the table data in the query engine. 

Reporter: Xinli Shang / @shangxinli Assignee: Xinli Shang / @shangxinli

Subtasks:

Note: This issue was originally created as PARQUET-1792. Please see the migration documentation for further details.

asfimport commented 4 years ago

Gabor Szadovszky / @gszadovszky: If you are talking about one file at a time you might be right that it is 10x faster than doing it by a query engine. But the tool is running on one node while the query engine uses several ones at the same time so I am not sure about the 10x performance. Pruning the file makes sense to me to be written at the library level because you can do it in an effective way (do not need to unpack/decode the pages or the entire column chunks). To mask the values in the other hand requires to read the actual values and to generate the hashes. You also need to generate the related statistics. Therefore, I am not sure if this masking feature properly suited for parquet-mr.

asfimport commented 4 years ago

Gidon Gershinsky / @ggershinsky: There is also a security aspect. While "prune" cleanly removes a sensitive column, and therefore is safe - "mask"/"redact" replaces one version of the column data with another version, and can easily leak sensitive information if not done properly.  I believe that at this stage, its best done above Parquet - by the users, who can simply add columns with the masked data. It can be also faster than Parquet tools, if run on a multi-threaded engine, as mentioned by Gabor.

We're working on a system that would allow to analyze Parquet files with masked/redacted columns, and detect information leakage. This would also allow to perform masking inside Parquet libraries, making it fast / multi-threaded. But this project will take a while to complete. It's not urgent though, since, again, masking/redaction can be easily implemented by the users today, above Parquet. 

asfimport commented 4 years ago

Xinli Shang / @shangxinli: @gszadovszky  the tool can be run in parallel in a cluster. For example, we can easily write a Spark application to do it. Actually even for 'prune', we still need to write Spark application to parallel it. Otherwise, the time to finish is still significant, although it is already faster than doing it in query engines.  

Regarding reading the original value and generating the hash/statistics, we only need to do it for the columns to be masked. In many cases, what we see is that there are only very few columns to be masked. For all other columns that don't need to be masked, we just move them as a whole like 'merge' or 'prune' command, which would be a big saving. But yes, this operation would be slower than 'prune' command, but it still can save huge comparing with doing it via query engine.  

asfimport commented 4 years ago

Xinli Shang / @shangxinli: @ggershinsky, this is just a simple offline tool to replace the raw columns with masked value. It is different from what we talked about earlier for the data obfuscation feature. The difference is that users have to run this tool explicitly and they are aware of what the data to be after translation. There is no chance that they accidentally or implicitly, or doing it by default.

The tool can provide a different way to translate the raw data to masked value and can allow the user to define their own if they have security concerns. We just provide the tool to make their work easier. In addition, ORC already has those mask mechanism released.  

As mentioned earlier, I can send an email to dev email group to see if they have the needs of this tool. 

Again, this proposal is independent of the data obfuscation that we are jointly working on it. 

 

 

 

asfimport commented 3 years ago

Gabor Szadovszky / @gszadovszky: [~shangx@uber.com], is this still targeted for 1.12.0?

asfimport commented 3 years ago

Xinli Shang / @shangxinli: We might want to push it for next release.

asfimport commented 3 years ago

Gabor Szadovszky / @gszadovszky: Removed the target 1.12.0.

asfimport commented 3 years ago

Sudhakar J Pyndi: Could you please confirm the target version for this feature? Thank you