SciRuby / daru

Data Analysis in RUby
BSD 2-Clause "Simplified" License
1.04k stars 139 forks source link

Missing values #208

Closed lokeshh closed 8 years ago

lokeshh commented 8 years ago

Addresses #137

lokeshh commented 8 years ago

I'm not clear about is whether we are removing set_missing_positions or not? If we remove it then its not possible to do the caching for nil and Float::NAN and if we do not remove it then, how will we get rid of lazy update which makes Daru slow.

v0dro commented 8 years ago

Let set_missing_positions remain for now. This will also involve keeping lazy update, but I really don't see a fast way of updating the missing values cache for now. We'll probably need to port it to C or something funky like that.

lokeshh commented 8 years ago

We can also do something as follows:

We will store the missing values positions in two variables, one for nil and one for Float::NAN. They will store the positions of the missing data. Whenever an update is made instead of calling set_missing_positions we will set @missing_positions_outdated = false. And whenever in the future we require operation regarding missing value we will check @missing_positions_outdated, if its false, we need not recompute the missing positions and if its true we need to recompute the missing positions and update the appropriate variables.

WDYT about this approach?

v0dro commented 8 years ago

We will store the missing values positions in two variables, one for nil and one for Float::NAN.

Why do nil and Float::NAN need to be in separate variables?

Whenever an update is made instead of calling set_missing_positions we will set @missing_positions_outdated = false.

Update to what? The vector or the missing positions cache?

And whenever in the future we require operation regarding missing value we will check @missing_positions_outdated, if its false, we need not recompute the missing positions and if its true we need to recompute the missing positions and update the appropriate variables.

Works! Make sure you reflect this everywhere in the code, though.

lokeshh commented 8 years ago

Why do nil and Float::NAN need to be in separate variables?

User might ask for reject_values(nil) or reject_values(Float::NAN).

Update to what? The vector or the missing positions cache?

I meant update to vector.

Works! Make sure you reflect this everywhere in the code, though.

Also we won't require lazy_update anymore now.

lokeshh commented 8 years ago

I'm deprecating the functions has_missing_data? etc., so they will be working but I'm getting rid of their functionality of working with arbitrary assigned missing values like 1, 2, etc. which are other than nil, Float::NAN. In other words missing_values = [1, 2] would have no effect. This makes it easy for me to remove the lazy_update and set_missing_positions. lazy_update and set_missing_positions have to removed to see the expected speed up. Let me know if its fine.

lokeshh commented 8 years ago

Kindly have a look. I've made some radical changes like removed some methods like exist?, missing_values=, etc. and finally got rid of lazy_update, set_missing_positions to speed up Daru. Some common methods like has_missing_data?, only_valid, etc have been deprecated and been kept though. It might cause inconvenience for the next Daru release for methods like missing_values won't be working but I insist removing them as they will lead to better performance. Now I will carry on caching the missing positions of nil and Float::NAN.

lokeshh commented 8 years ago

Here are some benchmarks. The performance gains are staggering.

I ran below code for this branch and master branch:

require 'benchmark'

n = 1000
dv = Daru::Vector.new 1..100000

Benchmark.bm do |x|
  x.report { 1000.times { dv[0, 10, 20, 30] = nil } }
end

Result for current branch:

       user     system      total        real
   0.010000   0.000000   0.010000 (  0.004940)

Result for master branch:

       user     system      total        real
  36.160000   0.260000  36.420000 ( 39.420752)
zverok commented 8 years ago

Here are some benchmarks. The performance gains are staggering.

Fascinating! It's a giant leap forward. Now it looks like we can even think about replacing where with normal select :)

Bravo, @lokeshh!

lokeshh commented 8 years ago

Thanks @zverok.

I have been running benchmarks on implementation with caching of nil's positions and implementation with no caching of nils and surprisingly I'm roughly the same time with caching. I couldn't figure out whats going on. The code seems correct. I'm looking at this in more detail. Is there some gem to profile which lines in the method are taking a long time and which not?

lokeshh commented 8 years ago

It seems to me that its because operations such as #at, #[] are more expensive such that finding missing values in a vector is almost free, therefore caching of nil and Float::NAN is not helping.

Update: Here's the benchmark I used to compare the performance of caching with non-caching implementation:

require 'benchmark'

n = 10000
dv = Daru::Vector.new 1..n

Benchmark.bm do |x|
  x.report do
    dv.reject_values nil
  end

  x.report do
    100.times { dv.reject_values nil }
  end
end

Result with caching:

       user     system      total        real
   0.020000   0.010000   0.030000 (  0.034508)
   1.040000   0.020000   1.060000 (  1.059687)

Result without caching:

       user     system      total        real
   0.030000   0.010000   0.040000 (  0.034908)
   1.330000   0.010000   1.340000 (  1.344577)

I expected to see the second report in caching implementation way less than the second report in non-caching implementation because after first call to reject_values, the positions of nils are getting cached in implementation with caching. But this is not the case, there's very less improvement.

v0dro commented 8 years ago

Hmmm it's interesting to see that caching isn't really helping much. Can you try with profiling and see what's actually going on? I think you can try ruby-prof for this.

zverok commented 8 years ago

I'm looking at this in more detail. Is there some gem to profile which lines in the method are taking a long time and which not?

Yeah, like @v0dro says, ruby-prof is pretty easy to use. You just do something like this:

RubyProf.start
(code you want to profile)
result = RubyProf.stop
printer = RubyProf::GraphHtmlPrinter.new(result)
printer.print(File.open('profile.html', 'w'))

And then that file would have pretty list of what've taken how many time, cross-linked to other methods it uses and stuff.

lokeshh commented 8 years ago

@zverok In our case, the behavior is not exactly not the same, so I guess I will go with writing different specs. Right?

zverok commented 8 years ago

@lokeshh Yep, while extracting the same behaviors into shared example groups.

lokeshh commented 8 years ago

@lokeshh Yep, while extracting the same behaviors into shared example groups.

@zverok Do you mean extracting the same behaviors in common specs and then using include_context?

zverok commented 8 years ago

Yep.

lokeshh commented 8 years ago

Alright. Thanks!

lokeshh commented 8 years ago

The PR is complete. I will start adding documentation. @zverok I will take care of extracting the common behavior later on.

Here's the short summary:

Methods added in Daru::Vector (and category):

Methods added in Daru::DataFrame:

Here are the benchmarks.

Methods removed in Daru::Vector:

and other methods have been deprecated.

lokeshh commented 8 years ago

All done. This is ready to be merged.

v0dro commented 8 years ago

Works. Something will probably break in statsample so you might need to fix it.

lokeshh commented 8 years ago

@v0dro. Done. Modified the Statsample to work with this PR.