apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.49k stars 1.37k forks source link

[C++] Data set integrity tool #2251

Open asfimport opened 5 years ago

asfimport commented 5 years ago

Parquet encryption protects integrity of individual files. However, data sets (such as tables) are often written as a collection of files, say

"/path/to/dataset"/part0.parquet.encrypted

..

"/path/to/dataset"/partN.parquet.encrypted

 

In an untrusted storage, removal of one or more files will go unnoticed. Replacement of one file contents with another will go unnoticed, unless a user has provided unique AAD prefixes for each file.

 

The data set integrity tool solves these problems. While it doesn't necessarily belong in Parquet functionality (that is focused on individual files (?)) - it will assist higher level frameworks that use Parquet, to cryptographically protect integrity of data sets comprised of multiple files.

The use of this tool is not obligatory, as frameworks can use other means to verify table (file collection) integrity.

 

The tool works by creating a small file, that can be stored as say

"/path/to/dataset"/.dataset.signature

 

that contains the dataset unique name (URI) and the number of files. It can also contain an explicit list of file names (with or without full path). The file contents is either encrypted with AES-GCM  (authenticated, encrypted) - or hashed and signed (authenticated, plaintext). 

 

On the writer side, the tools creates AAD prefixes for every data file, and creates the signature file itself. The input is the dataset URI, N and the encryption/signature key; plus (optionally) the list of file names (with or without full path).

 

On the reader side, the tool parses and verifies the signature file, and provides the framework with the verified dataset name, number of files that must be accounted for, and the AAD prefix for each file;  plus (optionally) the list of file names (with or without full path). The input is the expected dataset URI and the encryption/signature key.

 

 

 

Reporter: Gidon Gershinsky / @ggershinsky Assignee: Gidon Gershinsky / @ggershinsky

Related issues:

Note: This issue was originally created as PARQUET-1457. Please see the migration documentation for further details.

asfimport commented 5 years ago

Ryan Blue / @rdblue: @ggershinsky, this sounds like a reasonable extension to a table format and not really something that I think Parquet should be doing.

What do you think about coming up with a proposal for snapshot integrity for Iceberg?

asfimport commented 5 years ago

Gidon Gershinsky / @ggershinsky: Sounds good, I'll prepare this proposal.

Process-wise, will need to handle first some internal paperwork required for contributing to a different project.