SciRuby / daru

Data Analysis in RUby
BSD 2-Clause "Simplified" License
1.03k stars 139 forks source link

Status of this Project? #531

Open jonspalmer opened 4 years ago

jonspalmer commented 4 years ago

Can anyone share thoughts on the state of this project?

This project contains a lot of amazing work and I'd love to be able to use it in a bunch of Ruby/Rails projects. However, there are a few issues:

  1. This project looks somewhat abandoned
  2. Lack of support for Ruby > 2.4. Is a blocker for us #505
  3. Several dependencies on projects that also appear to be somewhat abandoned.

Questions:

I'd be happy to contribute ideas and time on plotting a path forward but would love to be part of team that is driving towards that.

baarkerlounger commented 4 years ago

@v0dro @zverok I think you'd be best placed to answer this. Fair to say neither of you are working on this project anymore?

zverok commented 4 years ago

To the best of my knowledge, the answer is as follows.

Short answer: yes, seems nobody to work on the project actively at the time.

Long answer: The library was created by @v0dro and since some point, evolved by contributions of SciRuby's students mostly. I was involved with the library development at some point (mostly as a SciRuby mentor), but since I am retired from SciRuby, I am not anymore.

Unfortunately, I lost touch with Sameer (@v0dro) since then, the last I know he was working on Rubex, which also seems dormant now. Considering his latest blog entries, it seems that he (like me before that) became disillusioned in prospects of gaining attention to Ruby from the scientific community, and switched to Python, but it is just my assumptions.

ankane commented 4 years ago

Hey @zverok, thanks for the update, and all of your work on SciRuby.

Since it sounds like Daru is in a holding pattern, I put together a take on data frames for Ruby called Rover. It uses Numo internally for performance (similar to how Pandas uses NumPy), and I've moved Prophet to it, as well as added support to XGBoost and LightGBM. I think there's a lot more to do, but hopefully it provides a good starting point. Would love to hear others' thoughts on the topic - feel free to open an issue to discuss.

baarkerlounger commented 4 years ago

@ankane I'm curious, would you mind giving any more background - where you see the rover project going, why the start from scratch approach and how you think it compares to Daru (and Pandas) currently?

ankane commented 4 years ago

@baarkerlounger The end goal is to make data analysis and machine learning as enjoyable as possible in Ruby. The main difference from Daru is its based on Numo, for both performance and easy integration with Rumale.

arbox commented 4 years ago

@ankane, following your work I'd like to ask you if it's any feasible to join forces and leverage Daru's codebase in your Rover project?

ankane commented 4 years ago

@arbox I'm happy to chat on specific features / ideas, but want to give this a fresh look (and think there's limited benefit to reusing the codebase).

kojix2 commented 4 years ago

My favorite Ruby's fancy DataFrame looks like this.

It's a bit like Daru at first sight. but, it's much faster because it uses numo-narray or Apache Arrow on the backend. The features are simple and it looks clean.

Let's consider modifying Daru. In this case, you need to remove a lot of Daru's code. That decision may be difficult for a non-founder to make because it may destroy compatibility.

So it's reasonable that someone starts new projects. The mountains won't get higher if they don't have a wider base.

(Ruby is an object-oriented language. How do we integrate with the functions to create an easy-to-use data frame? I think this is the big mystery that current version of Daru has left behind... It should be lots of different ideas and works. )

jonspalmer commented 4 years ago

Here is my take. There is a lot to like about this project. The code is very clean and has some nice abstractions. However, as I stated i the initial question there are some challenges today in terms of dependencies and support for Ruby > 2.4.

@ankane I'd disagree with you that there is little to be benefit from the existing code base. There are a lot of features you don't have in Rover yet. Plus the implementation you have at the moment is strongly coupled to Numo.

@kojix2 We can solve the problems you mention. I don't think its removing a lot of code (I share a desire for this direction but don't believe its necessary: https://github.com/kojix2/chai) Also I'm sure we can solve 'non founder' ownership and breaking changes it happens all the time.

The mountains won't get higher if they don't have a wider base.

I agree with this but lets make sure we build a mountain and don't end up with a bunch of half finished hills 😄

IMO there is value in a Ruby implementation of DataFrames. For me its a convenient way to transport and mutate 'matrix' like data. It's not required that it be blazingly fast of have amazing scale in the data sets it runs. Nor IMO is it a requirement that you be able to do everything in a Jupyter notebook. At a certain point if you want all those features the simpler leap is to go you Python and Pandas.

However, I'd love to see this project get a bit more life into it and to update it to make it more usable. My wish list would be:

  1. Add a configuration mechanism for the underlying "Array" storage.
    1. Remove the core dependencies on NMatrix & GSL. Perhaps we publish them as separate Gems for backwards compatibility. But frankly they're so old and unsupported I'm not convinced
    2. Add an implementation of Array storage backed by Numo.
  2. Add a configuration mechanism for plotting library
    1. Remove the core dependencies on gruff publish them as gems
  3. Update the DataFrame API to much more closely match Pandas
    1. Rename some functions
    2. Add missing features (I have some as monkey patches)
  4. Cleanup the Ruby backports and other hacks

I don't think any of this is particularly difficult. There is a pretty long list of features to close the gap to Pandas but they can be easily done one at a time with easy to review PRs.

I'd be happy to contribute I'd be happy to lead but I don't have the bandwidth do it alone. If there are others willing to help we can do it. @kojix2 @ankane @arbox @baarkerlounger and others are you up for giving it a go?

zverok commented 4 years ago

Just to add one more variety to already existing list of opinions!

I believe there are two, not really related questions about "dataframe implementation" in Ruby:

  1. Interface (as in API)
  2. Implementation (including performance and most of the "features")

My interest was always around (1): how may we create a generally useful dataframe, which would be obviously helpful for everyone in the community (in fact, my involvement with Daru started with this 4-year old blog post, and that was the direction of my work in a maintainer role). "Generally useful", for me, also implies "naturally playing" with all Rubyist intuitions, emerging from Enumerable, Hash, Array etc. I once dreamed of ever-present "dataframe API" that could've suite "backends" as different as SQL db, CSV file or Apache Arrow. And I imagine that this work might also lead to accepting some useful "micro-metaphors" as useful idioms for other collections and libraries.

For what I see, most of the other people involved in Daru's or other dataframe libraries' development mostly focused on datframes that would be a) fast for large data processing; b) include wast set of features (with emphasis on "it is possibe/done one method" even at a price of API consistency and knowledgeability); c) frequently, look familiar rather for "those who already worked with Pandas" than "those who already worked with Ruby Array/Hash".

I'd imagine some ideal several layers project, with:

  1. one layer focused on relatively small and consistent Ruby DataFrame API,
  2. another on "just features", lots of additional methods that can be implemented in terms of said API,
  3. even another on backends for said API (even as different as "DataFrame interface to PostgreSQL table", which will convert all API calls into SQL statements)

I might see that there are some parts of Daru that might be "scraped" for "layer 2" (various features), but in terms of API, I never really liked it, and don't believe it can be "gradually improved" (I had tried it for several monthes in 2018 and finally just gave up).

TBH, I even have my own (unfinished and unpublished) "completely new DF project" with API I really love, but ... I don't see an audience for it anymore.

jonspalmer commented 4 years ago

@zverok Thanks for sharing your perspective. I share a very keen desire for #1 too. That seems fundamental. As I said if you really care about performance/scale then IMO you're going to get better luck by using Pandas. It's likely to always be faster etc. That doesn't stop us from having "fast" options in Ruby but that speed shouldn't come at the cost of API clarity.

I think there is a balance to be struck between having a similar API to Pandas and feeling like Ruby. There is huge value in a Ruby DataFrame that at a minimum has the same features, with the same names as Pandas. Perhaps this is a simple as naming:

Having used Daru DataFrames in a project in anger I haven't felt the same sense of "its not Ruby" that you suggest. I'd like to understand that more. So two followup questions:

  1. Can you share some examples of "it is possible/done one method" in Daru? I agree they exist but I'm curious which ones stand out to you.
  2. Can you share some places where Daru "doesn't feel like Ruby" and a sketch of how we could make and API that is better suited to Rubyists?
zverok commented 4 years ago

@jonspalmer It is hard for me to answer the questions in details: both technically, because I haven't been involved with Daru for ~2 years now, and ethically, as a lot of its features either was implemented by Sameer, who is the initial author or by students of SciRuby we were mentoring, so pointing at any particular feature would be also pointing at a person who invented/implemented it.

Also, I have changed my mind about "what's best" several times, and I am currently even not sure whether "really useful" solution will be one class (DataFrame) or some family of classes (like, "table which is first and foremost for navigation/presentation" and "table that is first and foremost for complex math").

I suspect that the base API design decisions are very fundamental and it is hard to change them once settled. Between those, I'd say: indexing/enumerating; what are "indexes" (one or two of them, nesting of indexes and other behaiors), desired level of "matematicity" of the DF (whether df1 + df2 is "sum of each pair of elements" or more like Array#+), what's the policy of supported data types, copy-on-write semantics, etc., etc.

jonspalmer commented 4 years ago

@zverok

Also, I have changed my mind about "what's best" several times, and I am currently even not sure whether "really useful" solution will be one class (DataFrame) or some family of classes (like, "table which is first and foremost for navigation/presentation" and "table that is first and foremost for complex math").

IMO the answer on this is very clear - one class. The Panda's api is strong in this regard. From my perspective there isn't any need for more than two main objects DataFrame and Vector (and perhaps we call it Vector because that's what Ruby's Matrix class uses vs Panda's Series). Having the arithmetic and statistical functions as first class methods on DataFrame and Vector is simple and doesn't get in the way if you don't need it.

For things like df1 + df2 it sort of doesn't matter. Provided that there is a clear method with a good name that this + operator is an alias for its just syntactic sugar. I don't believe that any particular choice is more Ruby like than any other. We should pick one, ideally Panda's like, and document it. Similarly things like copy-on-write you just need to decide, document and provide an option to change the default. This isn't really a Ruby or not-Ruby like concern it's just picking an API.

WRT indexing - I'm not sure there is a Ruby right or wrong answer. You could argue that we should mimic the dual index treatment in the Ruby Matrix class but honestly its not obviously right either. Consistency with that API doesn't bother me too much.

To summarize I don't see these as big challenges. We could just decide and start moving things to a better place.

zverok commented 4 years ago

@jonspalmer I believe you seriously underestimate the design space (and how design decisions affect the library's usability). Let's stick just to one example of two parts (it is just to illustrate the point). Imagine you have data shaped this way:

         Q1   Q2   Q3 Total
      ---------------------
Smith | 100  100  150   450
Jones | 200  180  200   580
Khan  | 150  180  180   510
Total | 450  470  530  1450

Question 1 (a more simple one, but very first in design, which every DF designer handles somehow): how do you address "Total" column and "Total" row? Everybody tends to start with...

df['Total'] # column or row?

...and there are many different ways to handle it :)

Now, to dive into some details. (For the sake of simplicity, let's say we decided that df.col['Total'] would be a way to address columns.) How do you express "add 10 to (each value of) column Total"?

  1. df.col['Total'] += 10 is "obvious" choice, but it is performed as df.col['Total'] = df.col['Total'] + 10, e.g. "produce intermediate array, then replace the column in dataframe", which, without some clever internal optimizations, could be very ineffective.
  2. df.col['Total'].map! { |val| val + 10 } is "just regular Ruby", but (again without some complicated tricks) it is bound to perform row-by-row, effectively prohibiting "backend optimization" (e.g. if internal Vector is some fast C numeric array which has optimized operation "increase each value"), or complicated backends (like those that really will convert it into UPDATE table SET Total=Total+10 instead of performing literally)
  3. some df.col['Total'].increase(10) kinda "solves" the two problems above, but now it stops to look "elegant" (implying that there would be a custom method for every math operation, and something like ((value + 10) / 2).round(3).clamp(0..100) will be inexpressable at all)
  4. ...and so on :)

I am honestly not sure the "whatever, let's design it someway and that will be it" is the approach that will lead anywhere useful. In fact, several existing (and incompatible) dataframe libraries that Ruby has (besides Daru and new kid Rover, there are ... some: 1, 2, 3 (at one point endorsed by Ruby Association), 4, 5, etc.) clearly demonstrate that.

jonspalmer commented 4 years ago

Question 1 how do you address "Total" column and "Total" row? df['Total'] # column or row?

Answer: a) you just have to choose. There is no right or wrong. b) Many/most DataFrame implementations have to decide what the fundamental objects are. That decision should inform the answers. So for Pandas and Daru the right answer is "Column" because Columns are the first class object and rows are not. It's therefore natural for the 'first index' to represent columns (which also encourages efficient data access)

"add 10 to (each value of) column Total"

df.col['Total'] += 10 is typically syntactic sugar for df.col['Total'] = df.col['Total'] + 10 which it turn is syntactic sugar for df.col['Total'] = df.col['Total'].add(10). So now there are two things to consider:

df.col['Total'].map! { |val| val + 10 }

a) most APIs would expect df.col['Total'] to produce a copy of the column. So this example doesn't fix the assignment problem b) the column methods map!(&block) and add(value) are both great Ruby methods but they are not the same. map and map! assume you're going to do something different with each element of the column as such it can't/shouldn't be optimized to to be efficient if you happen to do the exact same thing to each element. If you want that use the add method.

df.col['Total'].increase(10)

we have the column copy problem again. You're increase method is just the existing add method on Daru::Vector. Seems fine to me.

to your larger example:

col = df.col['Total']
newCol = ((col + 10) / 2).round(3).clamp(0..100)
# or
newCol = col.add(10).div(2).round(3).clamp(0..100)
# is this more efficient?
newCol = col.map { |v| ((v +10)/2).round(3).clamp(0..100) }

I don't know which would be faster/more efficient. Whatever the answer it would need to be carefully measured. The DataFrame API can't anticipate all the use cases it can only provide reasonable, well named building blocks to allow options for the consumer to use to solve their particular problem.

Are there cases where you really, really want to do things 'in place'? Perhaps but the use cases are going to be very specific and the 'right way' to optimize them is going to be very subtle.

To take your example of wanting to "add 10 to Total". You could argue that we really need to manipulate "Total" in place because it's more efficient. However, its more likely that the situation is something like this:

         Q1   Q2   Q3
      ---------------
Smith | 100  100  150
Jones | 200  180  200
Khan  | 150  180  180

df.col['Total'] = df.col['Q1'] + df.col['Q2'] + df.col['Q3']

So now you could say "Hey I really want to add 10 to "Total" in place. It sucks that this is 'so inefficient'"

df.col['Total'] = df.col['Total']  + 10

but that's the wrong problem to go after. It would be way better to simply fix it when you generate 'Total' the first time. (it's user error not API error)

df.col['Total'] = df.col[['Q1', 'Q2', 'Q3']].sum(axis: :column) + 10
df.row['Total'] = df.sum(axis: :row)

Which doesn't require inefficient intermediate columns/rows.

My argument isn't that we should blindly build an API and hope it works out. Instead we should carefully build something that is clear, flexible and consistent that allows consumers to solve their problems. We cannot nor should not expect the API to solve every corner case cleanly or naturally. Specific problem will require specific solutions. The more specific the problem the less likely the solution will be "elegant" but that's an entirely normal and expected tradeoff.

From my perspective a lot of work and real world use cases have gone into the Pandas API and it is powerful, feature rich and natural to use. The choices and design there would be very natural to replicate in Ruby (with the exception perhaps being the Python slice operator being a bit more flexible than Ruby ranges). Daru has made a great initial set of steps to get close to replicating that API. IMO we should continue that work and bring it up to date with the current state or Ruby and Pandas. I'm not totally clear what you are proposing as an alternative?

v0dro commented 4 years ago

Status update on daru. CC: @zverok @jonspalmer @kojix2

I had a brief e-mail conversation with @jonspalmer and have promised him to update daru on the following points:

  1. Make the gem >2.4 compatible.
  2. Close or merge pending issues and PRs.
  3. Remove the nmatrix dependency and make it a separate plugin.
  4. Release a new gem version.

However, I am currently in grad school and using a lot of low level C++/C/FORTRAN for my research and have therefore lost touch with data analysis. I will happy if someone else is willing to take over the project. As @kojix2 says, having a dataframe with support for Arrow would be great. Relying on Ruby for speed is a bad idea. However all this will require a central point of contact (i.e. a maintainer willing to commit a few hours a week).

My take on the future direction of daru is that we should forget the scientific computing audience and let them be happy with Julia/Python/R. Our real audience should lie in the Ruby community (web dev etc.). This was pointed out by @zverok much earlier and I agree that his course of action would have been appropriate.

BTW Numo has some speed issues due to the data representation that it uses (last I checked was more than 6 months ago) and I'm not sure if they've been resolved. @prasunanand can fill in on this better, I believe.

v0dro commented 4 years ago

New daru version has been released and all old PRs have been merged/closed.

jonspalmer commented 3 years ago

@v0dro Following up on this what is the new version that has been released? I don't see any new tags here.

v0dro commented 3 years ago

My bad I forgot to tag it. You can see it on rubygems here: https://rubygems.org/gems/daru

Version 0.3