apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.49k stars 3.52k forks source link

[Python][Packaging] Add Parquet encryption/decryption / OpenSSL to Python wheels #24254

Open asfimport opened 4 years ago

asfimport commented 4 years ago

Reporter: Wes McKinney / @wesm

Subtasks:

Note: This issue was originally created as ARROW-8040. Please see the migration documentation for further details.

asfimport commented 4 years ago

Itamar Turner-Trauring / @itamarst: I would like to get this working (doing this on behalf of a client)—the packaging sides seem relatively simple, just adding the right flags to the build scripts, and maybe making sure OpenSSL is compiled in statically.

However, it doesn't seem like there's Python bindings for the encryption? Or at least, it's not clear to me how to use Parquet encryption from Python... So does that need to be done separately? Or is there an example I can look at?

Thanks!

asfimport commented 4 years ago

Itamar Turner-Trauring / @itamarst: Just checking in again—could someone answer my question above re the need for Python bindings in addition to packaging wheels? Thanks!

asfimport commented 4 years ago

Antoine Pitrou / @pitrou: Right, it seems that encryption options are not exposed in the Python bindings. So I suggest you start with that. Do you need any guidance? Places to look at probably include pyarrow/_parquet.pxd, pyarrow/_parquet.pyx and pyarrow/parquet.py.

asfimport commented 4 years ago

Antoine Pitrou / @pitrou: Perhaps open a separate JIRA for exposing encryption options in the Python bindings?

asfimport commented 4 years ago

Itamar Turner-Trauring / @itamarst: Thanks for the confirmation. Will open new issue and try to create a first pass API design.

asfimport commented 4 years ago

Gidon Gershinsky / @ggershinsky: @itamarst  Sorry for missing the start of this discussion. Please see my response in the list, and feel free to contact me with any questions, I'd be glad to help with this.

asfimport commented 3 years ago

Itamar Turner-Trauring / @itamarst: So it seems like high-level encryption is going to be a long time coming, and access to low-level encryption is still useful. So going to try to do that.

asfimport commented 3 years ago

Gidon Gershinsky / @ggershinsky: Hi Itamar, to sum up my previous comments on this, there are two issues with exposing the low level encryption to the users,

asfimport commented 3 years ago

Itamar Turner-Trauring / @itamarst: Given people are paying me to try to get this moving, some people are using the low-level API directly already, and they want to use it from Python too, and now. So exposing it doesn't necessarily make their situation any worse. And I do understand the concerns about interop.

So my thought was to expose the low-level API in a way that didn't preclude a higher-level API, and made it clear (via documentation and API names) that there will eventually be a nicer, higher-level API. Would something like that take away your concerns?

asfimport commented 3 years ago

Gidon Gershinsky / @ggershinsky: I presume you are getting paid to apply this code in their usecase, not to upstream it to an open source repository. I don't have concerns with the internal usage.

asfimport commented 3 years ago

Itamar Turner-Trauring / @itamarst: Sure, but that suggests it might be useful to other people? The low-level encryption API is already exposed in C+, and is the only API in C+, so people can already use it right now. It's possible no one is, of course.

Basically argument I'm making is "this will be useful to some people". That may not override the concerns about security, of course.

asfimport commented 3 years ago

Itamar Turner-Trauring / @itamarst: The cryptography library has this issue, with primitives they don't think most people should be using, and the solution is having a "hazardous materials" part of the API, explicitly marked as such: https://cryptography.io/en/latest/hazmat/primitives/index.html

asfimport commented 3 years ago

Gidon Gershinsky / @ggershinsky: Yep, I understand. But with this being a data protection technology, we need to take extra care to keep it secure. 

Regarding C++ - we're building this bottom up, starting with a cipher layer, then low-level encryption layer, then the high-level layer. No way this'd be accepted by the community as a single PR. The current situation is not ideal, it'd be good to merge the high-level PR (and maybe hide the low level), but here we are; also, C++ is a kind of a low-level language; Python would expose it to a less experienced audience.

asfimport commented 3 years ago

Itamar Turner-Trauring / @itamarst: On the list, Micah Kornfield pointed out that decryption doesn't have the security concerns, and insofar as interop is a concern it's never going to produce incompatible files. So I'm going to start with doing that as a minimum, and we can revisit encryption as a second pass if the high-level API is still in limbo.

asfimport commented 3 years ago

Gidon Gershinsky / @ggershinsky: Thanks. I agree the decryption part is much safer than encryption. Regarding interop - this decryption code will not be able to read files, produced by Spark, because that would require parsing the key material. Implementation of key material parsing means implementation of the high level layer. I'll send a mail to the list to check if this can be indeed be moved out of a state of limbo.

asfimport commented 3 years ago

Itamar Turner-Trauring / @itamarst: High-level API would be better, yeah. So I'm just going to do decryption low-level, have my client test it and see if it works for them. And if high-level gets merged soon (and that works for them), even better, can switch to wrapping that for Python.

asfimport commented 3 years ago

Gidon Gershinsky / @ggershinsky: Sure, SGTM

asfimport commented 3 years ago

Itamar Turner-Trauring / @itamarst: Turns out I probalby need to wrap some of the encryption API just to make testing easy :( So the first pass might include that, but won't expose it publicly.

asfimport commented 3 years ago

Itamar Turner-Trauring / @itamarst: I have an initial working sketch PR: https://github.com/apache/arrow/pull/9631

Would appreciate any feedback, especially on API design. As mentioned above, I was forced to wrap encryption in order to do testing, but the plan is to not make it public part of the API.

asfimport commented 3 years ago

Itamar Turner-Trauring / @itamarst: And also, of course, this only wraps parts of the decryption API, I plan to wrap the rest of it as well.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Note that Parquet modular encryption is now available in the Python wheels (there are nightly builds: https://arrow.apache.org/docs/python/install.html#installing-nightly-packages ). Should this JIRA be kept open?