Open sebcrozet opened 6 years ago
Originally posted by @jturner314 in https://github.com/rustsim/nalgebra/issues/473#issuecomment-439587433
@paddyhoran I came across anowell/are-we-learning-yet#14 today, which is related. I'm glad to see so many data science projects adding support for a common in-memory data format like Apache Arrow, and it looks relatively simple to add conversions of 1-D vectors/arrays to Apache Arrow. Looking at the Apache Arrow docs, though, it seems like the intended application is more for libraries like Pandas (data frames) than libraries like NumPy/ndarray
/nalgebra
(multidimensional arrays/matrices of a single element type). It appears that to represent more than one dimension, you have to use nested List types, which have variable lengths and require a buffer of offsets in addition to the values themselves. Is that true? How would be be the best way to represent something like a 2-D or 3-D array in Apache Arrow?
@paddyhoran Thank you for mentioning ApacheArrow! This is definitely something to keep in mind and perhaps consider once the rust implementation of ApacheArrow becomes usable.
@jturner314 Although DataFrame like data structures are the main focus of Apache Arrow's columnar format it does extend beyond that also. For instance the ray project uses the Arrow format in their distributed machine learning framework.
The tensor implementation is here. If there is anything you need to add you could always discuss on the mailing list.
There are a couple of nice things that adopting Arrow could enable. Algorithms could be shared across libraries, zero copy sharing of data between libraries (even in different languages, pyspark now uses arrow to avoid copies when moving between the JVM and Python). Data storage formats, IPC, alignments that are most suitable for SIMD etc. As the Rust Arrow implementation matures you could get a lot "for free".
You could also have tight integration with other Rust projects in the data engineering space such as data fusion
It seems that an ecosystem will expand around Arrow, check out Gandiva within the arrow C++ implementation, it's an expression compiler that uses LLVM. We will likely add rust bindings for this is the future. People are also using Arrow on the GPU but most of this in happening in C++/Python land right now.
The tensor implementation is here.
Okay, thanks. I missed the description of tensors in my first read-through of the docs. It looks like the tensor format is defined here. ndarray
can represent all ApacheArrow tensors (except for the "name" field on dimensions). ApacheArrow tensors can represent most ndarray
arrays (the exception is negative strides). Based on the discussion in #473, nalgebra
and ApacheArrow tensors are compatible (except for the "name" field on dimensions, and maybe strides = zero).
There are a couple of nice things that adopting Arrow could enable. Algorithms could be shared across libraries, zero copy sharing of data between libraries (even in different languages, pyspark now uses arrow to avoid copies when moving between the JVM and Python). Data storage formats, IPC, alignments that are most suitable for SIMD etc. As the Rust Arrow implementation matures you could get a lot "for free".
Yeah, easy sharing of algorithms/data between libraries and languages would be really nice to reduce duplication of effort.
It seems that an ecosystem will expand around Arrow, check out Gandiva within the arrow C++ implementation, it's an expression compiler that uses LLVM.
Gandiva looks really cool!
All of this sounds really great! I'm one of the maintainers of ndarray
. How do I add support for ApacheArrow? Is the arrow
crate the right one to use? I don't see anything about tensors in the docs.
Is the
arrow
crate the right one to use? I don't see anything about tensors in the docs.
Yes it is, we are still early on in the project but building so there is a lot of polish missing.
How do I add support for ApacheArrow?
I would love to see other projects use the Arrow data structures internally but in the short term you could Implement the From
and Into
traits for your main data-structures until Arrow has a chance to become more robust, maybe behind a feature gate. In this way your users could still benefit from any new features in arrow but they would have to "roll their own" a little.
Longer term you could adopt Arrow internally and build new features for your own libraries that leverage the new building blocks that Arrow could provide.
For instance, here is the start of work on IPC (the tensor definition you found is actually for flatbuffers
which Arrow uses for IPC).
Or maybe just join the mailing list and monitor how the project progresses. Feel free to raise issues for anything that is stopping adoption for you, I'd be willing to take a look. Of course, we would love to have you as contributors also...
Originally posted by @paddyhoran in https://github.com/rustsim/nalgebra/issues/473#issuecomment-439575367
Hey, just wanted to leave a note here on this discussion. I have been contributing to ApacheArrow, I added a tensor implementation but this was mostly following the C++ implementation.
The Rust implementation is relatively new but is growing steadily. I was hoping in time to reach out to projects like
nalgebra
andndarray
to discuss how we can have different projects interoperate with each other, I'm just not there yet. This is the whole aim behind Arrow, a common memory layout for different projects to build on. I believe that a robust Arrow implementation for Rust could be more important than for the other languages because the Rust community is small so fragmentation of the parties interested in scientific computing is a risk. However, if we were all building off a common base, or more likely provided ways to convert to and from a common base it could help the community grow.Just wanted to get your thoughts on this let you know about Arrow.
Thanks