Open v0dro opened 7 years ago
Could you clarify what is meant by "broadcasting data structures"?
I've recently been playing around a bit with PyArrow and so far it seems like a performant internal data structure.
https://pyarrow.readthedocs.io/en/latest/ https://www.slideshare.net/wesm/python-data-wrangling-preparing-for-the-future https://github.com/SciRuby/daru/issues/164
If it were integrated into Daru, it may open open Daru to more of the Apache "Big Data" ecosystem, which would be nice. But I have no experience integrating C with Ruby, so don't really know what kind of effort this would require.
Broadcasting would basically involve changing the internal data structures in such a way that they are more efficient and reduce copying of data whenever possible. For example, pandas uses numpy internally, which supports broadcasting and hence makes pandas fast.
This will mainly involve choosing an appropriate matrix library like Numo::NArray or NMatrix and integrating them with daru. Some changes might be required in the matrix libraries to fully support broadcasting.
FYI:
Apache Arrow is implementing Tensor
object:
Pandas will use it in 2.0:
I'm working on Ruby bindings of Apache Arrow. They are already included in Apache Arrow partially:
@kou do you think we should leapfrog to using Apache Arrow Tensor directly for internal storage? I am seriously considering an overhaul of the daru storage infrastructure given the speed bottlenecks caused by creation of Ruby objects.
If it can be done in a transparent and dependency-free manner, can you please elaborate on how we can proceed for implementing this in daru?
@mrkn if you have experience with arrow can you please shed some light on this?
daru can use NMatrix or Numo::NAarray for internal object. Red Data Tools project provides libraries to convert them with low cost via Apache Arrow:
NMatrix#to_arrow
and Arrow::Tensor#to_nmatrix
.Numo::*#to_arrow
and Arrow::Tensor#to_narray
.You can convert NMatrix and Numo::NArray by the following:
nmatrix.to_arrow.to_narray # => Numo::*
narray.to_arrow.to_nmatrix # => NMatrix
Now, Apache Arrow focuses data format. It doesn't implement data operations yet. They will be implemented after Apache Arrow 1.0.0 is released.
Currently, daru uses Arrays for storing data inside Vectors, which are collectively stored inside a dataframe.
However, this approach reduces the speed of most mathematical operations due to everything being a Ruby object and all the looping operations happening in Ruby.
I would like to explore alternatives to re-implementing daru's internal data structures in something like
NMatrix
orNumo::NArray
for more efficient storage of data. @genya0407 's sake gem does this to some extent but it is still not as widespread as that of pandas.This will most probably make use of broadcasting data structures. In the interest of speed, do you all think it would be alright to sacrifice compatibility with JRuby? Since NMatrix has a Java backend, how about implementing broadcasting in NMatrix and rewriting daru's internals using NMatrix?
Please pitch in your ideas into this thread.
CC: @mrkn @zverok @genya0407 @gnilrets @lokeshh @kozo2