Closed asfimport closed 6 years ago
Wes McKinney / @wesm: Along with the 0.5.0 release, we should have a blog post that explains how the object store works and call for contributions, which might help with recruiting some Java developers to get involved
Lu Qi / @luchy0120: Hi,Philipp Moritz, I've been working on reading and writing Tensor in Java for several weeks. I've got Tensor structure like this: Class Tensor{ private float[] storage; private int[] shape } I used JNI to leverage plasma C++ client . One good thing is when writing tensor ,there is "getPrimitiveArrayCritical" method which gets the address in Java heap (based on vm impletation), thus I can construct Tensor in C++ easily without copying, although it stops GC in this process, but plasma writing is non blocking. On the other side of the world, when reading tensor , I need to copy the share memory into java heap, this will cost time. So, in order to save reading time , pure Java client may be a good choice.
As to pure Java client , may be we can use jni to get fd first and construct a FileDescriptor . https://stackoverflow.com/questions/4845122/using-a-numbered-file-descriptor-from-java
Philipp Moritz / @pcmoritz: Hey Lu Qi,
I have very limited experience with Java, here are some thoughts that are I hope are helpful:
You can do zero copy reads in Java using an off-heap method like http://xcorpion.tech/2016/09/10/It-s-all-about-buffers-zero-copy-mmap-and-Java-NIO/. Given the data already lives in (in-memory) memory-mapped files, this might be the best way to go forward here.
We would essentially define our own Tensor class and then use code like https://github.com/apache/spark/tree/50ada2a4d31609b6c828158cad8e128c2f605b8d/common/unsafe/src/main/java/org/apache/spark/unsafe (see for example https://github.com/apache/spark/blob/50ada2a4d31609b6c828158cad8e128c2f605b8d/common/unsafe/src/main/java/org/apache/spark/unsafe/array/LongArray.java) to access the data without copies.
Arrow already has a Tensor class in C++ that does similar things and the the current Python serialization code uses that to read Tensors in a zero copy way from the object store and expose them as numpy arrays to the user. On the Java side I think not much is available yet for reading tensors; as a point to get started, the code for parsing Tensor metadata is generated here: https://github.com/apache/arrow/blob/82eea49b3eea6047f53478113ab3ff9a38f0d344/java/format/pom.xml#L108
If you look at the code for reading C++ Tensors, you should be able to get a prototype of this working. I'm also cc'ing some of the people who have done most work on the Java implementation for more input.
@BryanCutler @siddharthteotia @jacques-n
– Philipp.
Lu Qi / @luchy0120:
Hi, Philipp,
Thanks for providing me these material. I see that numpy uses "PyArray_NewFromDescr" to wrap a
memory without copying data. So, on Java side, we will mimic this method and provide a wrapper
class for viewing or modify the underlying "mmap" share memory. But , for now , as in my case,
I have an already defined Tensor using float array . I have to copy data into it , which is pretty sad.
Maybe one day I can drop my Tensor
Philipp Moritz / @pcmoritz: That makes sense for now and I agree it's a little sad; for the future maybe you can get some insights from https://github.com/deeplearning4j/deeplearning4j on how to write the Tensor class in the "right" way; unfortunately Java doesn't really have a long tradition of scientific computing like Python has so there is no good standard Tensor classes like numpy.
Edit: This is also an opportunity for Arrow, if we had a good Java tensor class it could be widely used because of the increasing importance of deep learning. Another project to look at is https://github.com/intel-analytics/BigDL. We also wrote our own in the past: https://github.com/amplab/SparkNet/blob/master/src/main/scala/libs/NDArray.scala and https://github.com/amplab/SparkNet/blob/master/src/main/java/libs/JavaNDArray.java to interop with Caffe and TensorFlow, but it might not be too useful for shared memory.
Lu Qi / @luchy0120: I work for BigDL and we are looking for solutions to moving data from python to JAVA
Philipp Moritz / @pcmoritz: Issue resolved by pull request 2065 https://github.com/apache/arrow/pull/2065
Adam Gibson: Hey folks - adam from deeplearning4j here. nd4j is likely the closest thing to a "numpy" on the jvm you are going to get.
This is on top of being able to directly read numpy and tensorflow arrays directly in memory with zero copy, this is on top of being able to work with
mkl/cuda while also having a fairly friendly managed buffers story: https://deeplearning4j.org/workspaces
Apache tika and apache solr have not been afraid to work with us. I'd encourage folks to reach out to us in the future rather than just skimming and making some assumptions.
We'd be more than glad to engage the arrow community. We already have our own support for reading/writing apache arrow tensors:
https://github.com/deeplearning4j/deeplearning4j/tree/master/nd4j/nd4j-serde/nd4j-arrow
Apache mahout also uses our underlying JNI stack javacpp: https://github.com/apache/mahout/blob/master/viennacl-omp/pom.xml
We've also based our ETL software for pre processing data based on arrow as well:
We've done quite a few tricks with the javacpp tensorflow bindings as well to coax tensorflow graphs in to the nd4j environment for graph execution:
There is some neat work we could do together here if folks are interested. There doesn't seem to be too much interest in making the java
bindings work well with tensors (mainly because of the focus on python) but if there's anyone interested in making it work well, we'd be more than glad to support folks with such efforts.
We should start thinking about how a Java client for plasma would look like. Given the focus of arrow to support Python, C++ and Java really well, it is the next important target after Python and C++.
My preliminary thoughts on it are the following ones: We can either go with JNI and wrap the C++ client or (in my opinion preferable) write a pure Java client. It would communicate with the Plasma store via Java flatbuffers over sockets.
It seems that the only thing blocking a pure Java client at the moment is the way we ship file descriptors for the memory mapped files between store and client (see the file fling.cc in the Plasma repo). We would need to get rid of that because there is no pure Java API that allows transferring file descriptors over a process boundary. So the way to transfer memory mapped files over process boundaries then is probably to use the file system and keep the memory mapped files in the file system instead of unlinking them immediately (as we do at the moment), so they can be opened by the client process via their path.
The challenge in this case is how to clean the files up and make sure they are not lying around if the plasma store crashes. One option is to store the plasma store PID with the file (i.e. as part of the file name) and let the plasma store clean them up the next time it is started); maybe there is OS level support for temporary files we can reuse.
I probably won't get to this for a while, so if anybody needs this or has free cycles, they should feel free to chime in. Also opinions on the design are appreciated!
– Philipp.
Reporter: Philipp Moritz / @pcmoritz
PRs and other links:
Note: This issue was originally created as ARROW-1163. Please see the migration documentation for further details.