Closed asfimport closed 3 years ago
Mohammad Islam: Very good and relevant Jira @ggershinsky.
We ( at Uber ) are also interested about this feature. How it is coming along?
Gidon Gershinsky / @ggershinsky: Thank you Mohammad, this is moving along nicely.
The design is converging (thanks to Julien's and Marcel's feedback, and Ryan's spot-on comments), the implementation is half-way there, testing will start soon. A pull request should be ready in a couple of weeks.
Mohammad Islam: Thanks @ggershinsky for the updates.
I will also review the doc and add some of our requirements into the doc, if anything missing.
Looking forward for the patch.
Cesar Delgado / @beettlle: Saw the modular encryption doc were merged a couple days ago. Very happy to see this Jira moving along. Can't wait to have something to try in the near future.
Gidon Gershinsky / @ggershinsky: Thanks Cesar, sounds good!
DB Tsai: This feature will be very important for both Spark community and our use-case. We're looking forward to it! Thanks.
Gang Ma: Notice that the encryption format has been merged to the encryption branch of parquet-format repo, and some implementation pr has been filed to parquet-mr repo, so is there any release plan for this feature? will it be available in parquet 1.12.0 ?
Gabor Szadovszky / @gszadovszky: @ggershinsky, this is an umbrella jira. Is it related to the parquet-format release 2.7.0? Please, remove the tag if not.
Jason Brugger: What's the best way to get started with this on a Databricks cluster? If I install format-2.7.0 as a new library, how would I reference this data source in lieu of the cluster's default parquet library?
Gidon Gershinsky / @ggershinsky:
[~jasbru]
Currently, this can't be run on a Databricks cluster - besides the Thrift structures in parquet-format-2.7.0, it will also require a Java implementation of parquet encryption and key management libraries (not merged yet, but we're working on this).
Bogdan: I tried out the encryption branch but the latest commit (#614) is not even compiling.
Would you recommend a commit I could checkout and test the encryption feature?
Thanks in advance!
Gidon Gershinsky / @ggershinsky:
[~Vatkov]
, a number of pull requests are under construction.
Hopefully they will be sent to the repository by the next week. No guarantees they'll be merged by that time, but still you'll be able to assemble the outstanding pr's, and build/run the encryption code.
Venkata Satya Pradeep Srikakolapu: This feature nicely fits in my case. Any timeline on this? I am basically looking for a format which an encrypt PII/PHI columns when storing on Data Lake Store in Azure DataBricks cluster. Do you recommend any other alternative if this feature is not going to be available soon?
Gidon Gershinsky / @ggershinsky: [~prdpsvs@gmail.com] this feature is already implemented in parquet-cpp code (check Apache Arrow, from version 0.16). If you need a Java version, it should be available soon in parquet-mr, check #776 - the last pull request in the basic encryption layer. We're working to make it a part of the next parquet-mr release, 1.12. No specific timelines at this point. On top of the basic encryption layer, we're building a high level interface that will simplify using the parquet encryption, see PARQUET-1568. Updated details are coming up, we plan to try to make it a part of v1.12 too.
Venkata Satya Pradeep Srikakolapu: Thank you @ggershinsky for quick reply. Could you please point me to an example for implementing this feature in Apache arrow? I am interested to understand key management for encrypting/decrypting columns with Apache Arrow.
I am working with a customer from Health Care space. My customer wants to encrypt sensitive columns while persisting data to the disk (Data Lake). I see few options for column encryption & Key Management
Some context: My customer is using Apache Parquet with scala + Spark extensively. My customer is also planning to use Python with Parquet.
Could you please recommend what would be the best choice?
Gidon Gershinsky / @ggershinsky: hard to say what's best at this point. Here's the arrow/parquet-cpp encryption sample.
Bogdan: @ggershinsky, any ideas how the schedule for 1.12 looks like? Any chance for it to happen before 2021?
Mike Dias: Hello, we'd love to see this feature released so we to start the work required to fully integrate with Spark. Right now we are compiling both Parquet and Spark master branches to get access to the feature and that is preventing us to move forward with column encryption in production systems. Is there any expected date to release it in 1.12?
Gabor Szadovszky / @gszadovszky: [~mike_dias], the Spark community is still working on to migrate to 1.11 (see SPARK-26346 for details). There are transitive dependency issues with Avro. I cannot say any ETA for 1.12 but I am not sure if even we would be able to release it tomorrow it would be available in Spark soon.
Henry Jones: Similar to the questions above, I'd love a rough estimate for the target release date for this. Storing sensitive data in encrypted form whilst retaining the ability to filter and search is an amazing feature which we'd love to work with at our company.
Gabor Szadovszky / @gszadovszky: I hope, we can do a release candidate next month.
Gidon Gershinsky / @ggershinsky: Released. Thanks to all who've contributed to this new Parquet capability!
A mechanism for modular encryption and decryption of Parquet files. Allows to keep data fully encrypted in the storage - while enabling efficient analytics on the data, via reader-side extraction / authentication / decryption of data subsets required by columnar projection and predicate push-down.
Enables fine-grained access control to column data by encrypting different columns with different keys.
Supports a number of encryption algorithms, to account for different security and performance requirements.
Reporter: Gidon Gershinsky / @ggershinsky Assignee: Gidon Gershinsky / @ggershinsky
Subtasks:
Related issues:
PRs and other links:
Note: This issue was originally created as PARQUET-1178. Please see the migration documentation for further details.