SciRuby / statsample

A suite for basic and advanced statistics on Ruby.
https://github.com/sciruby/statsample
BSD 3-Clause "New" or "Revised" License
95 stars 29 forks source link

Change statsample to use daru data structures #35

Closed v0dro closed 9 years ago

v0dro commented 9 years ago

Makes bivariate compatible with daru.

@agarie if you want to test on your computer please pull my daru/master.

PS: I think it would be best if we kept this PR open until I'm entirely done with this task. Will add commits as they are ready.

agarie commented 9 years ago

Sameer, before we move on to use Daru::DataFrame and Daru::Vector as the defaults in Statsample, we need to have in mind the situation in which a user has NMatrix or GSL installed (or not). So:

  1. No NMatrix, no GSL.
  2. No NMatrix, GSL. (IIRC this should be possible as I think that rb-gsl works without NArray; there are checks for NArray availability in its code)
  3. NMatrix, no GSL.
  4. NMatrix and GSL.

We need to be very explicit about the advantages and disadvantages of each situation. Of course, when gem install nmatrix starts working out of the box with no external dependencies, it will become reasonable to require it to be installed alongside Daru and Statsample.

My point is: we need to document what we're doing and what it means for a user. I still have some opinions and suggestions, but I can't write them right now. Also, we need passing tests!! :P

v0dro commented 9 years ago

I shall address your concerns one by one:

  1. Daru has been designed (NMatrixWrapper, GSLWrapper, etc. ) such that it provides the exact same functionality with or without NMatrix or GSL. So is statsample. The only drawback will be efficiency of storage and speed of computation due to pure ruby implementations (for example use of ruby Array to store data instead of C arrays as in the case of NMatrix of GSL).
  2. No nmatrix and having GSL will mainly have an implication on statsample's performance because statsample uses GSL optionally, not nmatrix. Stats methods in daru will not improve by using nmatrix because of this bug (for now). Having rb-gsl or gsl-nmatrix and no NMatrix will not affect statsample's usage of GSL methods. This will be possible even in the presence/absence of narray as it is only affected by the presence of NMatrix. Neither daru nor statsample require nmatrix as a mandatory requirement.
  3. If nmatrix is present and GSL absent, it will affect daru, i.e data can be stored in nmatrix, but statsample will still use ruby-defined statistical methods (i.e. will run slower).
  4. NMatrix and rb-gsl (with narray) cannot co-exist. Hence if nmatrix is installed, user will have to use gsl-nmatrix and not rb-gsl. Later when we can safely use nmatrix as the primary data store in daru, it will be possible to directly convert between nmatrix (in daru) to GSL::Vector/Matrix (in statsample) at C level (very efficient). This will be a major boost, where nmatrix and gsl-nmatrix can be used together, and will be the best and fastest.

And yes I'm looking forward to the day when gem install nmatrix just works so that it can be a runtime dependency :P

agarie commented 9 years ago

Good! Just to be clear: I know those points, but it is necessary to have these situations really well explained to a potential user, so no one avoid using Statsample or Daru because "NMatrix is hard to install", "I don't know what is GSL" or whatever.

If you want to test this library out, just gem install daru statsample and read the tutorials!

Is a really, really good way of decreasing the entry barrier. :tada:

agarie commented 9 years ago

wtf

/home/travis/.rvm/gems/ruby-head/gems/backports-3.6.4/lib/backports/1.9.1/stdlib/prime.rb:86:in `<class:Prime>': private method `new' called for Prime:Class (NoMethodError)
# giant backtrace

Huh. Well, the good thing is, I'm trying to find some time to work on Distribution anyway, so this is another bug to fix. I'll open an issue there.

v0dro commented 9 years ago

That might be a bug in ruby-head. Its the unstable branch :P

agarie commented 9 years ago
Finished in 47.594492s, 6.4503 runs/s, 34.2897 assertions/s.

307 runs, 1632 assertions, 0 failures, 0 errors, 0 skips

Good job! I still need to finish reading your code before merging (and the docs, etc). Maybe fix these warnings in the tests about DataFrame, too, but that can wait for another pull request.

agarie commented 9 years ago

There are two errors in the test suite:

  1) Error:
StatsampleGGobiTestCase#test_values_definition:
NameError: uninitialized constant Statsample::SpreadsheetBase
    /home/travis/build/SciRuby/statsample/lib/statsample/converters.rb:51:in `<module:Statsample>'
    /home/travis/build/SciRuby/statsample/lib/statsample/converters.rb:2:in `<top (required)>'
    /home/travis/build/SciRuby/statsample/test/test_ggobi.rb:14:in `test_values_definition'
  2) Error:
StatsampleGGobiTestCase#test_variable_definition:
NameError: uninitialized constant Statsample::SpreadsheetBase
    /home/travis/build/SciRuby/statsample/lib/statsample/converters.rb:51:in `<module:Statsample>'
    /home/travis/build/SciRuby/statsample/lib/statsample/converters.rb:2:in `<top (required)>'
    /home/travis/build/SciRuby/statsample/test/test_ggobi.rb:21:in `test_variable_definition'
agarie commented 9 years ago

I'm really happy with this PR (except for its size ;). I've reviewed most of it already, but I want to take a look at the changes on the tests before merging.

v0dro commented 9 years ago

Oh my bad.

Will fix soon.

Tests have slowed down a bit because of some things that daru does automatically (like determine types and missing data positions) but speed will improve with time. I do have some ideas on that.

agarie commented 9 years ago

There are a lot of things that were inherited from the original source, so I'll let it pass (and fix it afterwards). I just nitpicked on some stuff that should be removed before merging.

These shouldn't take more than a few minutes. As soon as you're finished, I'll merge.

agarie commented 9 years ago

By the way, the most beautiful thing in this PR is...

+2,292 −5,234

agarie commented 9 years ago

Two problems:

  1. Travis build: do not depend on an unreleased library version. One thing at a time. You can update the build script to depend on your commit so we can see the tests passing correctly, but I'm not merging a PR this size without seeing the tests green.
  2. Rebase against the current master! :P

Almost there!

Both statsample-glm and statsample-bivariate-extension should be merged after finishing this one, too, so don't worry.

v0dro commented 9 years ago

Yes I thought I'll release a new version of daru once you approve the code.

Cool I'll do that!

v0dro commented 9 years ago

@agarie its done.

Oh I think the most beautiful thing in the PR is the way all these components are working together so beautifully.