apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.97k stars 3.41k forks source link

[Java] Implement non-sparse tensors #26115

Open asfimport opened 3 years ago

asfimport commented 3 years ago

We'd like to be able to round-trip NumPy ndarrays through Java, and create tensors in Java that can be eventually mapped to ndarrays in Python. Having even a basic Tensor implementation, with extension types, as a contrib module would help greatly.

Some prior discussions

Reporter: David Li / @lidavidm

PRs and other links:

Note: This issue was originally created as ARROW-10101. Please see the migration documentation for further details.

asfimport commented 3 years ago

Joris Van den Bossche / @jorisvandenbossche: @lidavidm From the description above, it's not fully clear to me if you are talking about the (standalone) Tensor message type of the IPC protocol, or about storing a tensor as a value in a RecordBatch field.

Your description seems to talk about the first, but the mailing list thread talks about the second I think. There are also some open issues about defining a standard ExtensionType for storing arrays in RecordBatch fields (ARROW-1614, ARROW-8714)

asfimport commented 3 years ago

David Li / @lidavidm: Hey @jorisvandenbossche this is about the standalone Tensor type - I'd like both eventually, but having the Tensor type itself implemented is a prerequisite to that, at least for our use cases (Python <-> Java). Thanks for the pointers!

 

asfimport commented 3 years ago

Micah Kornfield / @emkornfield: @lidavidm  i took a very cursory look at the code and it seems straight-forward.  But one question, I had is if there is an existing OSS tensor model that makes sense for us to re-use or is the Arrow off-heap/object model enough of a snowflake to make that impractical?

asfimport commented 3 years ago

David Li / @lidavidm: Thanks [~emkornfield@gmail.com]. I'm not aware of an existing model. Honestly, my intent here is not really to provide an API to manipulate them in Java, but to just make it possible to round-trip them and convert to/from other APIs, hence why the methods on this Tensor are pretty sparse.

A brief search turns up these:

asfimport commented 3 years ago

Micah Kornfield / @emkornfield: Thanks for investigating, I'm not an expert in this space, but I can try to take a look at the PR if no one else has provided feedback.

asfimport commented 3 years ago

David Li / @lidavidm: Thanks Micah - I'd appreciate any feedback on the API and approach before I go implement the rest of the tensor classes. I think like Python, it mostly suffices to have something that makes it easy to convert to/from the API actually being used.