apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.69k stars 422 forks source link

PARQUET-2491: Address AES GCM invocation limit in encryption spec #259

Closed ggershinsky closed 2 weeks ago

ggershinsky commented 3 weeks ago

Follow up on this thread.

ggershinsky commented 3 weeks ago

cc @pitrou

pitrou commented 3 weeks ago

How about the following wording to better explain the limit:

Encryption security of the two supported AES modes relies on the uniqueness of the (encryption key, nonce) pair accross all invocations of the encryption function. If a given nonce is reused by mistake with the same encryption key, then an attacker could get critical information about the encryption key and potentially break the data encrypted with this particular key.

Since this specification requires nonces to be random-generated, this implies a statistical limit to the number of cipher invocations for a given encryption key. By convention, the section 8 of the NIST SP 800-38D document limits the number of cipher invocations to 2^32 to ensure that the probability of nonce reuse stays below 2^-32.

In Parquet, each encrypted module consists of a separate cipher invocation. Moreover, the vast majority of encrypted modules are page headers and page payloads. This therefore translates into the following conventional limits:

  • If data in the Parquet file is encrypted using the file's footer key, then there should be no more than 2^31 pages in a given file, and no more than 2^31 pages accross all files encrypted with the same footer key (in any case, it is recommended to never reuse the same footer key accross files);
  • If data in the Parquet file is encrypted using each column's encryption key, then there should no more than 2^31 pages in each given column.
pitrou commented 3 weeks ago

@alamb @tustvold Do you want to chime in on this?

pitrou commented 3 weeks ago

By the way, would this issue be alleviated if column keys were always derived from the footer key? What is the use case for not using dedicated column keys?

tustvold commented 3 weeks ago

Do you want to chime in on this

Thank you for the ping, the Rust reader doesn't currently support encryption, and so I don't have the necessary context to comment in a meaningful capacity

ggershinsky commented 3 weeks ago

Encryption security of the two supported AES modes relies on the uniqueness of the (encryption key, nonce) pair accross all invocations of the encryption function. If a given nonce is reused by mistake with the same encryption key, then an attacker could get critical information about the encryption key and potentially break the data encrypted with this particular key.

Since this specification requires nonces to be random-generated, this implies a statistical limit to the number of cipher invocations for a given encryption key. By convention, the section 8 of the NIST SP 800-38D document limits the number of cipher invocations to 2^32 to ensure that the probability of nonce reuse stays below 2^-32.

The ultimate source of this Parquet limitation is the NIST spec, section 8.3 (more specifically, this text : "The total number of invocations of the authenticated encryption function shall not exceed 2^32, including all IV lengths and all instances of the authenticated encryption function with the given key"). The Parquet spec can ref/site this text, but shouldn't supply an explanation (must be very careful here - best be left to the official NIST explanation in the section 8 and in the referenced papers). Our spec will also be more compact then, no overlap with the NIST doc.

In Parquet, each encrypted module consists of a separate cipher invocation. Moreover, the vast majority of encrypted modules are page headers and page payloads. This therefore translates into the following conventional limits:

  • If data in the Parquet file is encrypted using the file's footer key, then there should be no more than 2^31 pages in a given file, and no more than 2^31 pages accross all files encrypted with the same footer key (in any case, it is recommended to never reuse the same footer key accross files);
  • If data in the Parquet file is encrypted using each column's encryption key, then there should no more than 2^31 pages in each given column.

This is not entirely accurate. Some columns can be encrypted with the footer key.

alamb commented 3 weeks ago

Thank you @ggershinsky for the clarifications @ggershinsky and the ping @pitrou

ggershinsky commented 3 weeks ago

By the way, would this issue be alleviated if column keys were always derived from the footer key? What is the use case for not using dedicated column keys?

Thats the main of point of column encryption - to enable access control of sensitive columns. In this usecase, a user with access to the footer key should not be able to derive the column key. There is a usecase where this might be done - when the goal is to exceed the 2 billion page limit in a file with flat/uniform encryption (without supporting column access control). Then say each column key is derived from the footer key, and the limit becomes N x 2 billion (where N is the number of columns). But this is niche; and can be done already today. This spec explicitly leaves key management out of scope, as it will quickly go down a rabbit hole. As discussed above, we'll mention the invocation limit; the users will figure out the implications.

pitrou commented 3 weeks ago

The ultimate source of this Parquet limitation is the NIST spec, section 8.3 (more specifically, this text : "The total number of invocations of the authenticated encryption function shall not exceed 2^32, including all IV lengths and all instances of the authenticated encryption function with the given key"). The Parquet spec can ref/site this text, but shouldn't supply an explanation (must be very careful here - best be left to the official NIST explanation in the section 8 and in the referenced papers).

Two things:

  1. Our reference to the NIST doc should not misleading. In particular, the current wording makes it seem like 2^32 is some kind of binary threshold above which encryption is broken. This is not the case.

  2. IMHO, when it comes to crypto (or security in general), it's good to promote understanding of the reasons behind the rules, rather than just the rules themselves. But I admit there can be other viewpoints, so will leave it to you or other people to decide. @rdblue @zivanfi @gszadovszky

ggershinsky commented 3 weeks ago
  1. Our reference to the NIST doc should not misleading. In particular, the current wording makes it seem like 2^32 is some kind of binary threshold above which encryption is broken. This is not the case.

Yep, this wording will change (per a comment above).

  1. IMHO, when it comes to crypto (or security in general), it's good to promote understanding of the reasons behind the rules, rather than just the rules themselves. But I admit there can be other viewpoints, so will leave it to you or other people to decide. @rdblue @zivanfi @gszadovszky

I don't mind a quick general text that describes the reasoning behind the limit, but doesn't go into details. I'll try to put something together.

ggershinsky commented 2 weeks ago

Hi @pitrou , the PR is updated. Thanks for your work on reviewing it!

pitrou commented 2 weeks ago

Thanks for the updates!