apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.64k stars 1.41k forks source link

Parquet modular encryption #2110

Closed asfimport closed 3 years ago

asfimport commented 6 years ago

A mechanism for modular encryption and decryption of Parquet files. Allows to keep data fully encrypted in the storage - while enabling efficient analytics on the data, via reader-side extraction / authentication / decryption of data subsets required by columnar projection and predicate push-down.

Enables fine-grained access control to column data by encrypting different columns with different keys.

Supports a number of encryption algorithms, to account for different security and performance requirements.

Reporter: Gidon Gershinsky / @ggershinsky Assignee: Gidon Gershinsky / @ggershinsky

Subtasks:

Note: This issue was originally created as PARQUET-1178. Please see the migration documentation for further details.

asfimport commented 6 years ago

Mohammad Islam: Very good and relevant Jira @ggershinsky.

We ( at Uber ) are also interested about this feature.  How it is coming along?

 

 

asfimport commented 6 years ago

Gidon Gershinsky / @ggershinsky: Thank you Mohammad, this is moving along nicely.

The design is converging (thanks to Julien's and Marcel's feedback, and Ryan's spot-on comments), the implementation is half-way there, testing will start soon. A pull request should be ready in a couple of weeks.

asfimport commented 6 years ago

Mohammad Islam: Thanks @ggershinsky for the updates.

I will also review the doc and add some of our requirements into the doc, if anything missing.

Looking forward for  the patch.

 

asfimport commented 6 years ago

Cesar Delgado / @beettlle: Saw the modular encryption doc were merged a couple days ago.  Very happy to see this Jira moving along.  Can't wait to have something to try in the near future.

asfimport commented 6 years ago

Gidon Gershinsky / @ggershinsky: Thanks Cesar, sounds good!

asfimport commented 6 years ago

DB Tsai: This feature will be very important for both Spark community and our use-case. We're looking forward to it! Thanks.

asfimport commented 5 years ago

Gang Ma: Notice that the encryption format has been merged to the encryption branch of parquet-format repo, and some implementation pr has been filed to parquet-mr repo, so is there any release plan for this feature? will it be available in parquet 1.12.0 ?

asfimport commented 5 years ago

Gabor Szadovszky / @gszadovszky: @ggershinsky, this is an umbrella jira. Is it related to the parquet-format release 2.7.0? Please, remove the tag if not.

asfimport commented 4 years ago

Jason Brugger: What's the best way to get started with this on a Databricks cluster? If I install format-2.7.0 as a new library, how would I reference this data source in lieu of the cluster's default parquet library?

asfimport commented 4 years ago

Gidon Gershinsky / @ggershinsky: [~jasbru]  Currently, this can't be run on a Databricks cluster - besides the Thrift structures in parquet-format-2.7.0, it  will also require a Java implementation of parquet encryption and key management libraries (not merged yet, but we're working on this).

asfimport commented 4 years ago

Bogdan: I tried out the encryption branch but the latest commit (#614) is not even compiling.

Would you recommend a commit I could checkout and test the encryption feature?

Thanks in advance!

 

asfimport commented 4 years ago

Gidon Gershinsky / @ggershinsky: [~Vatkov], a number of pull requests are under construction.

Hopefully they will be sent to the repository by the next week. No guarantees they'll be merged by that time, but still you'll be able to assemble the outstanding pr's, and build/run the encryption code.

asfimport commented 4 years ago

Bogdan: Thanks @ggershinsky!

asfimport commented 4 years ago

Venkata Satya Pradeep Srikakolapu: This feature nicely fits in my case. Any timeline on this? I am basically looking for a format which an encrypt PII/PHI columns when storing on Data Lake Store in Azure DataBricks cluster. Do you recommend any other alternative if this feature is not going to be available soon?

asfimport commented 4 years ago

Gidon Gershinsky / @ggershinsky: [~prdpsvs@gmail.com] this feature is already implemented in parquet-cpp code (check Apache Arrow, from version 0.16). If you need a Java version, it should be available soon in parquet-mr, check #776 - the last pull request in the basic encryption layer. We're working to make it a part of the next parquet-mr release, 1.12. No specific timelines at this point. On top of the basic encryption layer, we're building a high level interface that will simplify using the parquet encryption, see PARQUET-1568. Updated details are coming up, we plan to try to make it a part of v1.12 too.

asfimport commented 4 years ago

Venkata Satya Pradeep Srikakolapu: Thank you @ggershinsky  for quick reply. Could you please point me to an example for implementing this feature in Apache arrow? I am interested to understand key management for encrypting/decrypting columns with Apache Arrow. 

I am working with a customer from Health Care space. My customer wants to encrypt sensitive columns while persisting data to the disk (Data Lake). I see few options for column encryption & Key Management

  1. Apache Arrow with Python - Do you recommend to use Apache Arrow with Scala?
  2. Parquet - MR - Not released yet 
  3. Encryption With ORC files (similar to Parquet Modular encryption) - https://jira.apache.org/jira/browse/ORC-14?jql=text%20~%20%22column%20level%20encryption%20to%20ORC%20files%22

Some context: My customer is using Apache Parquet with scala + Spark extensively. My customer is also planning to use Python with Parquet. 

Could you please recommend what would be the best choice?

 

 

asfimport commented 4 years ago

Gidon Gershinsky / @ggershinsky: hard to say what's best at this point. Here's the arrow/parquet-cpp encryption sample.

asfimport commented 4 years ago

Bogdan: @ggershinsky, any ideas how the schedule for 1.12 looks like? Any chance for it to happen before 2021?

asfimport commented 4 years ago

Mike Dias: Hello, we'd love to see this feature released so we to start the work required to fully integrate with Spark. Right now we are compiling both Parquet and Spark master branches to get access to the feature and that is preventing us to move forward with column encryption in production systems. Is there any expected date to release it in 1.12?

asfimport commented 4 years ago

Gabor Szadovszky / @gszadovszky: [~mike_dias], the Spark community is still working on to migrate to 1.11 (see SPARK-26346 for details). There are transitive dependency issues with Avro. I cannot say any ETA for 1.12 but I am not sure if even we would be able to release it tomorrow it would be available in Spark soon.

asfimport commented 4 years ago

Henry Jones: Similar to the questions above, I'd love a rough estimate for the target release date for this. Storing sensitive data in encrypted form whilst retaining the ability to filter and search is an amazing feature which we'd love to work with at our company. 

asfimport commented 4 years ago

Gabor Szadovszky / @gszadovszky: I hope, we can do a release candidate next month.

asfimport commented 3 years ago

Gidon Gershinsky / @ggershinsky: Released. Thanks to all who've contributed to this new Parquet capability!