Closed lokeshh closed 8 years ago
I'm not clear about is whether we are removing set_missing_positions
or not? If we remove it then its not possible to do the caching for nil
and Float::NAN
and if we do not remove it then, how will we get rid of lazy update which makes Daru slow.
Let set_missing_positions
remain for now. This will also involve keeping lazy update, but I really don't see a fast way of updating the missing values cache for now. We'll probably need to port it to C or something funky like that.
We can also do something as follows:
We will store the missing values positions in two variables, one for nil
and one for Float::NAN
. They will store the positions of the missing data. Whenever an update is made instead of calling set_missing_positions
we will set @missing_positions_outdated = false
. And whenever in the future we require operation regarding missing value we will check @missing_positions_outdated
, if its false
, we need not recompute the missing positions and if its true
we need to recompute the missing positions and update the appropriate variables.
WDYT about this approach?
We will store the missing values positions in two variables, one for nil and one for Float::NAN.
Why do nil
and Float::NAN
need to be in separate variables?
Whenever an update is made instead of calling set_missing_positions we will set @missing_positions_outdated = false.
Update to what? The vector or the missing positions cache?
And whenever in the future we require operation regarding missing value we will check @missing_positions_outdated, if its false, we need not recompute the missing positions and if its true we need to recompute the missing positions and update the appropriate variables.
Works! Make sure you reflect this everywhere in the code, though.
Why do nil and Float::NAN need to be in separate variables?
User might ask for reject_values(nil)
or reject_values(Float::NAN)
.
Update to what? The vector or the missing positions cache?
I meant update to vector.
Works! Make sure you reflect this everywhere in the code, though.
Also we won't require lazy_update
anymore now.
I'm deprecating the functions has_missing_data?
etc., so they will be working but I'm getting rid of their functionality of working with arbitrary assigned missing values like 1
, 2
, etc. which are other than nil
, Float::NAN
. In other words missing_values = [1, 2]
would have no effect. This makes it easy for me to remove the lazy_update
and set_missing_positions
. lazy_update
and set_missing_positions
have to removed to see the expected speed up. Let me know if its fine.
Kindly have a look. I've made some radical changes like removed some methods like exist?
, missing_values=
, etc. and finally got rid of lazy_update
, set_missing_positions
to speed up Daru. Some common methods like has_missing_data?
, only_valid
, etc have been deprecated and been kept though. It might cause inconvenience for the next Daru release for methods like missing_values
won't be working but I insist removing them as they will lead to better performance. Now I will carry on caching the missing positions of nil
and Float::NAN
.
Here are some benchmarks. The performance gains are staggering.
I ran below code for this branch and master branch:
require 'benchmark'
n = 1000
dv = Daru::Vector.new 1..100000
Benchmark.bm do |x|
x.report { 1000.times { dv[0, 10, 20, 30] = nil } }
end
Result for current branch:
user system total real
0.010000 0.000000 0.010000 ( 0.004940)
Result for master branch:
user system total real
36.160000 0.260000 36.420000 ( 39.420752)
Here are some benchmarks. The performance gains are staggering.
Fascinating! It's a giant leap forward. Now it looks like we can even think about replacing where
with normal select
:)
Bravo, @lokeshh!
Thanks @zverok.
I have been running benchmarks on implementation with caching of nil
's positions and implementation with no caching of nil
s and surprisingly I'm roughly the same time with caching. I couldn't figure out whats going on. The code seems correct. I'm looking at this in more detail. Is there some gem to profile which lines in the method are taking a long time and which not?
It seems to me that its because operations such as #at
, #[]
are more expensive such that finding missing values in a vector is almost free, therefore caching of nil
and Float::NAN
is not helping.
Update: Here's the benchmark I used to compare the performance of caching with non-caching implementation:
require 'benchmark'
n = 10000
dv = Daru::Vector.new 1..n
Benchmark.bm do |x|
x.report do
dv.reject_values nil
end
x.report do
100.times { dv.reject_values nil }
end
end
Result with caching:
user system total real
0.020000 0.010000 0.030000 ( 0.034508)
1.040000 0.020000 1.060000 ( 1.059687)
Result without caching:
user system total real
0.030000 0.010000 0.040000 ( 0.034908)
1.330000 0.010000 1.340000 ( 1.344577)
I expected to see the second report in caching implementation way less than the second report in non-caching implementation because after first call to reject_values
, the positions of nil
s are getting cached in implementation with caching. But this is not the case, there's very less improvement.
Hmmm it's interesting to see that caching isn't really helping much. Can you try with profiling and see what's actually going on? I think you can try ruby-prof for this.
I'm looking at this in more detail. Is there some gem to profile which lines in the method are taking a long time and which not?
Yeah, like @v0dro says, ruby-prof is pretty easy to use. You just do something like this:
RubyProf.start
(code you want to profile)
result = RubyProf.stop
printer = RubyProf::GraphHtmlPrinter.new(result)
printer.print(File.open('profile.html', 'w'))
And then that file would have pretty list of what've taken how many time, cross-linked to other methods it uses and stuff.
@zverok In our case, the behavior is not exactly not the same, so I guess I will go with writing different specs. Right?
@lokeshh Yep, while extracting the same behaviors into shared example groups.
@lokeshh Yep, while extracting the same behaviors into shared example groups.
@zverok Do you mean extracting the same behaviors in common specs and then using include_context
?
Yep.
Alright. Thanks!
The PR is complete. I will start adding documentation. @zverok I will take care of extracting the common behavior later on.
Here's the short summary:
Methods added in Daru::Vector (and category):
Methods added in Daru::DataFrame:
Here are the benchmarks.
Methods removed in Daru::Vector:
and other methods have been deprecated.
All done. This is ready to be merged.
Works. Something will probably break in statsample so you might need to fix it.
@v0dro. Done. Modified the Statsample to work with this PR.
Addresses #137