apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.02k stars 3.42k forks source link

[Python] Rewrite pyarrow.jvm using the C data interface #29891

Open asfimport opened 2 years ago

asfimport commented 2 years ago

The pyarrow.jvm is currently a custom-written bridge between PyArrow and Arrow Java, with limited datatype support. Now that Java implements the C data interface (see ARROW-12965), we should be able to simplify the code while making it more general.

Also, we should reenable the conda-python-jpype build somewhere, for example in the Crossbow nightly builds.

Reporter: Antoine Pitrou / @pitrou

Related issues:

Note: This issue was originally created as ARROW-14319. Please see the migration documentation for further details.

asfimport commented 2 years ago

Roee Shlomo / @roee88: I assume that backward compatibility is not required for internal use methods (i.e., starting with an underscore). What about jvm_buffer, should it just be kept as is?

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: jvm_buffer should probably be kept, yes. We may also want to deprecate it (it's not obvious it's useful in isolation).

As for your other question: indeed, methods starting with an underscore do not enter into backward compatibility concerns.

asfimport commented 2 years ago

Roee Shlomo / @roee88: I suspect that a better approach would be to create a new module and keep pyarrow.jvm as is:

  1. Backward compatibility seems like a challenge. There must be a reference provided to org.apache.arrow.c so ArrowSchema, ArrowArray and the various import/export functions would be available on the python side. In addition, all C data interface methods require an allocator as a parameter. These are not provided in the current pyarrow.jvm API. 
  2. The current pyarrow.jvm module works with a pure java build of Arrow Java, while the C data interface requires building a small JNI library. Unless you rely on end users to build the Java jar on their own, packaging the JNI lib will be required for all platforms targeted by pyarrow.
asfimport commented 2 years ago

Antoine Pitrou / @pitrou: cc @xhochy

asfimport commented 2 years ago

Roee Shlomo / @roee88: @pitrou  feel free to reuse code from my attempt the other day https://gist.github.com/roee88/4aa7dfeceb2d8c3d8868ed8465ebf561 if that helps. It's based on the java-python integration tests code for ARROW-14374 (with the original test_jvm.py tests updated).

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: @amol- ^^

asfimport commented 1 year ago

Todd Farmer / @toddfarmer: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per project policy. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

vibhatha commented 7 months ago

@jorisvandenbossche is there an ongoing effort to integrate C Data interface to pyarrow.jvm?

jorisvandenbossche commented 7 months ago

I am not aware of someone actually working on this, except for this issue tracking that we should at some point do that.

vibhatha commented 7 months ago

Would it be okay if I work on this?

jorisvandenbossche commented 7 months ago

Certainly!

vibhatha commented 7 months ago

take

vibhatha commented 1 month ago

@jorisvandenbossche I am removing my assignment since focus has been changed and I couldn't attend to this issue timely.