SciRuby / daru

Data Analysis in RUby
BSD 2-Clause "Simplified" License
1.04k stars 139 forks source link

Use of broadcasting data strucutures for daru internals. #328

Open v0dro opened 7 years ago

v0dro commented 7 years ago

Currently, daru uses Arrays for storing data inside Vectors, which are collectively stored inside a dataframe.

However, this approach reduces the speed of most mathematical operations due to everything being a Ruby object and all the looping operations happening in Ruby.

I would like to explore alternatives to re-implementing daru's internal data structures in something like NMatrix or Numo::NArray for more efficient storage of data. @genya0407 's sake gem does this to some extent but it is still not as widespread as that of pandas.

This will most probably make use of broadcasting data structures. In the interest of speed, do you all think it would be alright to sacrifice compatibility with JRuby? Since NMatrix has a Java backend, how about implementing broadcasting in NMatrix and rewriting daru's internals using NMatrix?

Please pitch in your ideas into this thread.

CC: @mrkn @zverok @genya0407 @gnilrets @lokeshh @kozo2

gnilrets commented 7 years ago

Could you clarify what is meant by "broadcasting data structures"?

I've recently been playing around a bit with PyArrow and so far it seems like a performant internal data structure.

https://pyarrow.readthedocs.io/en/latest/ https://www.slideshare.net/wesm/python-data-wrangling-preparing-for-the-future https://github.com/SciRuby/daru/issues/164

If it were integrated into Daru, it may open open Daru to more of the Apache "Big Data" ecosystem, which would be nice. But I have no experience integrating C with Ruby, so don't really know what kind of effort this would require.

v0dro commented 7 years ago

Broadcasting would basically involve changing the internal data structures in such a way that they are more efficient and reduce copying of data whenever possible. For example, pandas uses numpy internally, which supports broadcasting and hence makes pandas fast.

This will mainly involve choosing an appropriate matrix library like Numo::NArray or NMatrix and integrating them with daru. Some changes might be required in the matrix libraries to fully support broadcasting.

kou commented 7 years ago

FYI:

Apache Arrow is implementing Tensor object:

Pandas will use it in 2.0:

I'm working on Ruby bindings of Apache Arrow. They are already included in Apache Arrow partially:

v0dro commented 7 years ago

@kou do you think we should leapfrog to using Apache Arrow Tensor directly for internal storage? I am seriously considering an overhaul of the daru storage infrastructure given the speed bottlenecks caused by creation of Ruby objects.

If it can be done in a transparent and dependency-free manner, can you please elaborate on how we can proceed for implementing this in daru?

v0dro commented 7 years ago

@mrkn if you have experience with arrow can you please shed some light on this?

kou commented 7 years ago

daru can use NMatrix or Numo::NAarray for internal object. Red Data Tools project provides libraries to convert them with low cost via Apache Arrow:

You can convert NMatrix and Numo::NArray by the following:

nmatrix.to_arrow.to_narray # => Numo::*
narray.to_arrow.to_nmatrix # => NMatrix

Now, Apache Arrow focuses data format. It doesn't implement data operations yet. They will be implemented after Apache Arrow 1.0.0 is released.