Netflix / iceberg

Iceberg is a table format for large, slow-moving tabular data
Apache License 2.0
472 stars 59 forks source link

Snapshot cryptographic integrity #97

Closed ggershinsky closed 5 years ago

ggershinsky commented 5 years ago

Parquet encryption protects integrity of individual data files. However, in an untrusted storage, removal of one or more data file in a table might go unnoticed. Replacement of one file contents with another will go unnoticed, unless a user has provided a unique Parquet AAD prefix for each file.

The snapshot integrity mechanism implements cryptographic protection of integrity of data sets comprised of multiple Parquet files.

The mechanism works by creating a small signature file, that contains the table URI / snapshot ID and the number of files. It can also contain an explicit list of file names (with or without full path). The file contents is signed (can be also encrypted, with eg AES GCM).

On the writer side, the mechanism creates AAD prefixes for every data file, and creates the signature file itself. The input is the snapshot URI, N and the encryption/signature key; plus (optionally) the list of file names.

On the reader side, the mechanism parses and verifies the signature file, and provides the framework with the verified table URI / snapshot ID, number of files that must be accounted for, and the Parquet AAD prefix for each file; plus (optionally) the list of file names. The input is the signature file, encryption/signature key and (optionally) the expected table URI /snapshot ID.

rdblue commented 5 years ago

This issue has moved to the ASF project: https://github.com/apache/incubator-iceberg/issues/44