JuliaStats / NullableArrays.jl

DEPRECATED Prototype of the new JuliaStats NullableArrays package
Other
35 stars 21 forks source link

Add dropnull! #179

Closed cjprybol closed 7 years ago

cjprybol commented 7 years ago

I wanted to add a dropnull! function here before extending both dropnull and dropnull! in DataFrames (see https://github.com/JuliaStats/DataFrames.jl/issues/602 https://github.com/JuliaStats/DataFrames.jl/issues/1156)

This pull request currently does not pass the test cases that have been established for the current dropnull function, which I have used for testing dropnull!

using NullableArrays
using BenchmarkTools
using Base.Test

# dropnull(X::NullableVector)
z = NullableArray([1, 2, 3, 'a', 5, 'b', 7, 'c'], Int, Char)
# dropnull(X::AbstractVector)
A = Any[Nullable(1), Nullable(2), Nullable(3), Nullable(), Nullable(5),
        Nullable(), Nullable(7), Nullable()]
# dropnull(X::AbstractVector{<:Nullable})
B = convert(Vector{Nullable}, A)
@test dropnull(z) == dropnull!(z) == [1, 2, 3, 5, 7]
@test dropnull(A) == dropnull!(A) == [1, 2, 3, 5, 7]
@test dropnull(B) == dropnull!(B) == [1, 2, 3, 5, 7]

output

julia> @test dropnull(z) == dropnull!(z) == [1, 2, 3, 5, 7]
Test Passed
  Expression: $(Expr(:escape, :(dropnull(z)))) $(Expr(:escape, :(==))) $(Expr(:escape, :(dropnull!(z)))) $(Expr(:escape, :(==))) $(Expr(:escape, :([1,2,3,5,7])))
   Evaluated: [1,2,3,5,7] == [1,2,3,5,7] == [1,2,3,5,7]

julia> @test dropnull(A) == dropnull!(A) == [1, 2, 3, 5, 7]
Test Failed
  Expression: $(Expr(:escape, :(dropnull(A)))) $(Expr(:escape, :(==))) $(Expr(:escape, :(dropnull!(A)))) $(Expr(:escape, :(==))) $(Expr(:escape, :([1,2,3,5,7])))
   Evaluated: Any[1,2,3,5,7] == Any[1,2,3,5,7] == [1,2,3,5,7]
ERROR: There was an error during testing
 in record(::Base.Test.FallbackTestSet, ::Base.Test.Fail) at ./test.jl:397
 in do_test(::Base.Test.Returned, ::Expr) at ./test.jl:281

julia> @test dropnull(B) == dropnull!(B) == [1, 2, 3, 5, 7]
Test Failed
  Expression: $(Expr(:escape, :(dropnull(B)))) $(Expr(:escape, :(==))) $(Expr(:escape, :(dropnull!(B)))) $(Expr(:escape, :(==))) $(Expr(:escape, :([1,2,3,5,7])))
   Evaluated: Any[1,2,3,5,7] == Nullable[1,2,3,5,7] == [1,2,3,5,7]
ERROR: There was an error during testing
 in record(::Base.Test.FallbackTestSet, ::Base.Test.Fail) at ./test.jl:397
 in do_test(::Base.Test.Returned, ::Expr) at ./test.jl:281

It's unclear to me why in the second test case dropnull passes while dropnull! doesn't even though the outputs are both Array{Any,1}. Using isequal() does not change the test outcome.

julia> @test dropnull(A) == [1, 2, 3, 5, 7]
Test Passed
  Expression: dropnull(A) == [1,2,3,5,7]
   Evaluated: Any[1,2,3,5,7] == [1,2,3,5,7]

julia> @test dropnull!(A) == [1, 2, 3, 5, 7]
Test Failed
  Expression: dropnull!(A) == [1,2,3,5,7]
   Evaluated: Any[1,2,3,5,7] == [1,2,3,5,7]
ERROR: There was an error during testing
 in record(::Base.Test.FallbackTestSet, ::Base.Test.Fail) at ./test.jl:397
 in do_test(::Base.Test.Returned, ::Expr) at ./test.jl:281

 julia> A = Any[Nullable(1), Nullable(2), Nullable(3), Nullable(), Nullable(5),
                               Nullable(), Nullable(7), Nullable()];

 julia> typeof(dropnull(A))
 Array{Any,1}

 julia> typeof(dropnull!(A))
 Array{Any,1}

 julia> A = Any[Nullable(1), Nullable(2), Nullable(3), Nullable(), Nullable(5),
                               Nullable(), Nullable(7), Nullable()];

 julia> eltype(dropnull(A))
 Any

 julia> eltype(dropnull!(A))
 Any

I'm also interested in feedback on the third test case, where dropnull produces Array{Any,1} and dropnull! produces Array{Nullable,1}. Is there a reason to prefer one over the other?

julia> @test dropnull(B) == dropnull!(B) == [1, 2, 3, 5, 7]
Test Failed
  Expression: $(Expr(:escape, :(dropnull(B)))) $(Expr(:escape, :(==))) $(Expr(:escape, :(dropnull!(B)))) $(Expr(:escape, :(==))) $(Expr(:escape, :([1,2,3,5,7])))
   Evaluated: Any[1,2,3,5,7] == Nullable[1,2,3,5,7] == [1,2,3,5,7]
ERROR: There was an error during testing
 in record(::Base.Test.FallbackTestSet, ::Base.Test.Fail) at ./test.jl:397
 in do_test(::Base.Test.Returned, ::Expr) at ./test.jl:281

The benchmarks look ok except for dropnull!(C), where the inplace version is slower than dropnull(C) for reasons I don't understand. dropnull!(D) makes more allocations than dropnull(D) but is still faster

benchmark

srand(1)
A = NullableArray(rand([0, 1, Nullable()], Int(1e6)));
B = rand([0, 1, Nullable()], Int(1e6));
C = NullableArray(rand([:Q, 1, Nullable()], Int(1e6)));
D = rand([:Q, 1, Nullable()], Int(1e6));
@benchmark dropnull(A)
@benchmark dropnull!(A)
@benchmark dropnull(B)
@benchmark dropnull!(B)
@benchmark dropnull(C)
@benchmark dropnull!(C)
@benchmark dropnull(D)
@benchmark dropnull!(D)

results

julia> @benchmark dropnull(A)
BenchmarkTools.Trial:
  memory estimate:  6.03 mb
  allocs estimate:  6
  --------------
  minimum time:     5.301 ms (0.00% GC)
  median time:      6.044 ms (0.00% GC)
  mean time:        6.306 ms (4.95% GC)
  maximum time:     10.372 ms (13.32% GC)
  --------------
  samples:          793
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark dropnull!(A)
BenchmarkTools.Trial:
  memory estimate:  288.00 bytes
  allocs estimate:  4
  --------------
  minimum time:     998.744 μs (0.00% GC)
  median time:      1.081 ms (0.00% GC)
  mean time:        1.139 ms (0.00% GC)
  maximum time:     2.752 ms (0.00% GC)
  --------------
  samples:          4388
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark dropnull(B)
BenchmarkTools.Trial:
  memory estimate:  22.09 mb
  allocs estimate:  22
  --------------
  minimum time:     14.614 ms (9.83% GC)
  median time:      15.192 ms (10.47% GC)
  mean time:        15.631 ms (10.36% GC)
  maximum time:     19.828 ms (9.03% GC)
  --------------
  samples:          320
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark dropnull!(B)
BenchmarkTools.Trial:
  memory estimate:  160.00 bytes
  allocs estimate:  2
  --------------
  minimum time:     564.757 μs (0.00% GC)
  median time:      606.058 μs (0.00% GC)
  mean time:        634.440 μs (0.00% GC)
  maximum time:     1.357 ms (0.00% GC)
  --------------
  samples:          7862
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark dropnull(C)
BenchmarkTools.Trial:
  memory estimate:  6.05 mb
  allocs estimate:  6
  --------------
  minimum time:     6.013 ms (0.00% GC)
  median time:      6.934 ms (0.00% GC)
  mean time:        7.086 ms (4.10% GC)
  maximum time:     10.903 ms (17.99% GC)
  --------------
  samples:          706
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark dropnull!(C)
BenchmarkTools.Trial:
  memory estimate:  71.37 mb
  allocs estimate:  2004545
  --------------
  minimum time:     59.823 ms (8.33% GC)
  median time:      66.289 ms (9.09% GC)
  mean time:        66.455 ms (9.15% GC)
  maximum time:     76.234 ms (11.40% GC)
  --------------
  samples:          76
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark dropnull(D)
BenchmarkTools.Trial:
  memory estimate:  14.09 mb
  allocs estimate:  22
  --------------
  minimum time:     12.920 ms (0.00% GC)
  median time:      15.949 ms (9.93% GC)
  mean time:        15.856 ms (7.98% GC)
  maximum time:     22.595 ms (20.56% GC)
  --------------
  samples:          316
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark dropnull!(D)
BenchmarkTools.Trial:
  memory estimate:  20.37 mb
  allocs estimate:  667464
  --------------
  minimum time:     3.712 ms (0.00% GC)
  median time:      4.898 ms (18.68% GC)
  mean time:        5.015 ms (18.00% GC)
  maximum time:     8.406 ms (18.44% GC)
  --------------
  samples:          997
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

I think that's enough questions for one post. Thanks in advance for any advice on how to proceed.

nalimilan commented 7 years ago

Thanks for doing this. The tests fail because the arrays contain Nullable{Int} entries, even if they are printed like Int. You'll need a slightly different definition of dropnull! which calls convert(Vector{T}, ...) with Nullable{T} the element type of the input array. So you'll need a separate no-op definition for non-nullable arrays. You'll also have to check both the returned vector and the original vector, since they are different (though if everything is OK they should share the same storage).

As regards speed, you can improve it by accessing the values and isnull fields of the NullableArray. That should be much more efficient. Actually, instead of calling convert you can just return the values field.

Finally, for maximum speed, you can adapt the algorithm used by deleteat! so that you don't need to store the indices (which can take some space if there are many nulls), but instead remove the nulls on the fly. Shouldn't be hard.

cjprybol commented 7 years ago

Thanks for the feedback! I tried implementing what you suggested but backtracked. Generating the same return values as dropnull requires the dropnull! function to make a copy of the input vector, and those changes wouldn't be reflected inplace as the ! convention suggests. So I just limited the function to the simple deleteat! use.

Is there an inplace array conversion function like convert! that I've missed? I'm assuming reassignment of the input variable to a new array could be done using macros, eval, or maybe pointers, but I'm not sure if any of those are recommended. Happy to change tactics if there are any suggestions for how to inplace convert an Array from something like Array{Nullable{Int64}, 1} to Array{Int64,1}, I just couldn't find it.

Limitations/differences from dropnull are reflected in the help message and I've updated the tests to reflect the differences. I also updated one of the dropnull functions. It seemed that the res array was unneccessary and the first array copy, Y, could be modified instead. There was a slight performance improvement with the change and the test cases produced the expected output.

nalimilan commented 7 years ago

Thanks for the feedback! I tried implementing what you suggested but backtracked. Generating the same return values as dropnull requires the dropnull! function to make a copy of the input vector, and those changes wouldn't be reflected inplace as the ! convention suggests. So I just limited the function to the simple deleteat! use.

You don't need to make a copy, just call deleteat! on both the values and isnull fields, and return the former. The original array will remain a NullableArray, but without its missing values; the type cannot be changed in place but that's fine.

Is there an inplace array conversion function like convert! that I've missed? I'm assuming reassignment of the input variable to a new array could be done using macros, eval, or maybe pointers, but I'm not sure if any of those are recommended. Happy to change tactics if there are any suggestions for how to inplace convert an Array from something like Array{Nullable{Int64}, 1} to Array{Int64,1}, I just couldn't find it.

Limitations/differences from dropnull are reflected in the help message and I've updated the tests to reflect the differences. I also updated one of the dropnull functions. It seemed that the res array was unneccessary and the first array copy, Y, could be modified instead. There was a slight performance improvement with the change and the test cases produced the expected output.

Makes sense. Another optimization would be to check that Nullable <: eltype(Y) at the top of the function, and if not return copy(X). Indeed, in most cases the array will not be able to hold Nullable entries at all, in which case the filter + loop is not needed.

nalimilan commented 7 years ago

Finally, for maximum speed, you can adapt the algorithm used by deleteat! so that you don't need to store the indices (which can take some space if there are many nulls), but instead remove the nulls on the fly. Shouldn't be hard.

Actually, this suggestion I made is overkill for the NullableArray method. You just need to do:

hasvalue = !X.isnull
deleteat!(X.values, hasvalue)
resize!(X.isnull, length(X.values))
fill!(X.isnull, false)
return X.values

Since the plan is to replace the isnull field withhasvalue(withhasvalue == !isnull`), this will require no allocation at all at that point. We will even be a able to create views over the non-null entries without any copy.

codecov-io commented 7 years ago

Codecov Report

Merging #179 into master will increase coverage by 0.26%. The diff coverage is 100%.

@@            Coverage Diff            @@
##           master    #179      +/-   ##
=========================================
+ Coverage   63.43%   63.7%   +0.26%     
=========================================
  Files          13      13              
  Lines         681     686       +5     
=========================================
+ Hits          432     437       +5     
  Misses        249     249
Impacted Files Coverage Δ
src/primitives.jl 100% <100%> (ø) :white_check_mark:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 53e5e46...affbb61. Read the comment docs.

cjprybol commented 7 years ago

Actually, this suggestion I made is overkill for the NullableArray method. You just need to do:

hasvalue = !X.isnull
deleteat!(X.values, hasvalue)
resize!(X.isnull, length(X.values))
fill!(X.isnull, false)
return X.values

Since the plan is to replace the isnull field withhasvalue(withhasvalue == !isnull`), this will require no allocation at all at that point. We will even be a able to create views over the non-null entries without any copy.

This appears equivalent to deleteat!(X, find(X.isnull)).values. deleteat! requires indices, so using the boolean values of X.isnull doesn't work and find() is necessary. X.isnull is updated during deleteat! so the resize!(X.isnull, length(X.values)) and fill!(X.isnull, false) don't appear to be necessary.

julia> hasvalue = !X.isnull
20-element Array{Bool,1}:
  true
  true
  true
 false
  true
 false
  true
 false
 false
  true
  true
  true
  true
 false
  true
 false
  true
  true
  true
  true

julia> deleteat!(X.values, X.isnull)
ERROR: ArgumentError: indices must be unique and sorted
 in deleteat!(::Array{Int64,1}, ::Array{Bool,1}) at ./array.jl:614

julia> deleteat!(X, find(X.isnull))
14-element NullableArrays.NullableArray{Int64,1}:
 1
 0
 1
 1
 0
 1
 1
 1
 0
 1
 1
 0
 1
 1

julia> X.values
14-element Array{Int64,1}:
 1
 0
 1
 1
 0
 1
 1
 1
 0
 1
 1
 0
 1
 1

julia> X.isnull
14-element Array{Bool,1}:
 false
 false
 false
 false
 false
 false
 false
 false
 false
 false
 false
 false
 false
 false

I think that covers everything except for

Another optimization would be to check that Nullable <: eltype(Y) at the top of the function, and if not return copy(X). Indeed, in most cases the array will not be able to hold Nullable entries at all, in which case the filter + loop is not needed.

I'll have a go at that now

nalimilan commented 7 years ago

Yes, I just realized logical indexing wasn't supported (https://github.com/JuliaLang/julia/pull/20465). But you can work around that by doing deleteat!(X, (i for i in eachindex(X) if @inbounds X.isnull[i])), which does not allocate a new vector at all. I know my proposal was more complex than the shorter form you suggested, but I figured it might be faster since filling X.isnull with false should be more efficient than removing entries one by one (which calling deleteat!(X, ...) does under the hood). Would need to check that.

cjprybol commented 7 years ago

Thanks for explaining the allocation logic behind your suggestion (and for the other comments as well), that makes perfect sense! I hit some trouble with the @inbounds. It seems like the @inbounds isn't parsed correctly with that inline usage because it works fine with the @inbounds removed. Putting it in parenthesis didn't help either.

julia> X = NullableArray([1, 2, 3, Nullable(), 5, Nullable(), 7, Nullable()])
8-element NullableArrays.NullableArray{Int64,1}:
 1
 2
 3
 #NULL
 5
 #NULL
 7
 #NULL

julia> for i in (i for i in eachindex(X) if X.isnull[i])
           @show i
       end
i = 4
i = 6
i = 8

julia> for i in (i for i in eachindex(X) if @inbounds X.isnull[i])
           @show i
       end
ERROR: TypeError: non-boolean (Void) used in boolean context
 in start_filter(::##6#8, ::Base.OneTo{Int64}) at ./iterator.jl:196
 in start(::Filter{##6#8,Base.OneTo{Int64}}) at ./iterator.jl:191
 in anonymous at ./<missing>:?
julia> for i in (i for i in eachindex(X) if (@inbounds X.isnull[i]))
           @show i
       end
ERROR: TypeError: non-boolean (Void) used in boolean context
 in start_filter(::##10#12, ::Base.OneTo{Int64}) at ./iterator.jl:196
 in start(::Filter{##10#12,Base.OneTo{Int64}}) at ./iterator.jl:191
 in anonymous at ./<missing>:?

without the use of @inbounds the proposal seems a little slower

julia> using NullableArrays

julia> using BenchmarkTools

julia> import NullableArrays.dropnull!

julia> dropnull!(X::NullableVector) = deleteat!(X, find(X.isnull)).values
WARNING: Method definition dropnull!(NullableArrays.NullableArray{T<:Any, 1}) in module NullableArrays at /Users/Cameron/.julia/v0.5/NullableArrays/src/primitives.jl:202 overwritten in module Main at REPL[4]:1.
dropnull! (generic function with 3 methods)

julia> function dropnull_new!(X::NullableVector)
           deleteat!(X, (i for i in eachindex(X) if X.isnull[i])).values
       end
dropnull_new! (generic function with 1 method)

julia> srand(1)
MersenneTwister(UInt32[0x00000001],Base.dSFMT.DSFMT_state(Int32[1749029653,1072851681,1610647787,1072862326,1841712345,1073426746,-198061126,1073322060,-156153802,1073567984  …  1977574422,1073209915,278919868,1072835605,1290372147,18858467,1815133874,-1716870370,382,0]),[1.49131,1.32188,1.20941,1.67663,1.36239,1.92045,1.21571,1.71665,1.11597,1.15241  …  1.11209,1.02599,1.88475,1.65876,1.57727,1.96827,1.08836,1.32939,1.40152,1.01871],382)

julia> X = NullableArray(rand([1, 2, 3, 4, Nullable()], Int(1e6)));

julia> @benchmark dropnull!(X)
BenchmarkTools.Trial:
  memory estimate:  80.00 bytes
  allocs estimate:  1
  --------------
  minimum time:     1.083 ms (0.00% GC)
  median time:      1.096 ms (0.00% GC)
  mean time:        1.139 ms (0.00% GC)
  maximum time:     2.435 ms (0.00% GC)
  --------------
  samples:          4389
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> X = NullableArray(rand([1, 2, 3, 4, Nullable()], Int(1e6)));

julia> @benchmark dropnull!(X)
BenchmarkTools.Trial:
  memory estimate:  80.00 bytes
  allocs estimate:  1
  --------------
  minimum time:     1.085 ms (0.00% GC)
  median time:      1.097 ms (0.00% GC)
  mean time:        1.150 ms (0.00% GC)
  maximum time:     2.943 ms (0.00% GC)
  --------------
  samples:          4347
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> srand(1)
MersenneTwister(UInt32[0x00000001],Base.dSFMT.DSFMT_state(Int32[1749029653,1072851681,1610647787,1072862326,1841712345,1073426746,-198061126,1073322060,-156153802,1073567984  …  1977574422,1073209915,278919868,1072835605,1290372147,18858467,1815133874,-1716870370,382,0]),[1.2326,1.10967,1.95986,1.77561,1.64916,1.07679,1.76917,1.60431,1.80621,1.10386  …  1.48308,1.69298,1.08006,1.27888,1.24528,1.56128,1.33465,1.55301,1.65853,1.58222],382)

julia> X = NullableArray(rand([1, 2, 3, 4, Nullable()], Int(1e6)));

julia> @benchmark dropnull_new!(X)
BenchmarkTools.Trial:
  memory estimate:  64.00 bytes
  allocs estimate:  3
  --------------
  minimum time:     1.441 ms (0.00% GC)
  median time:      1.464 ms (0.00% GC)
  mean time:        1.514 ms (0.00% GC)
  maximum time:     3.139 ms (0.00% GC)
  --------------
  samples:          3302
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> X = NullableArray(rand([1, 2, 3, 4, Nullable()], Int(1e6)));

julia> @benchmark dropnull_new!(X)
BenchmarkTools.Trial:
  memory estimate:  64.00 bytes
  allocs estimate:  3
  --------------
  minimum time:     1.442 ms (0.00% GC)
  median time:      1.464 ms (0.00% GC)
  mean time:        1.518 ms (0.00% GC)
  maximum time:     3.179 ms (0.00% GC)
  --------------
  samples:          3294
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

Any ideas how to get around the @inbounds issue or am I missing something? Or should we hold back on this until your pull request to allow boolean indexing with deleteat! is merged and use that?

cjprybol commented 7 years ago

And while I'm looking at it, could you explain what the _isnull definitions starting at line 153 are for? I wasn't sure if they were necessary so I tried removing them and replacing their calls with isnull and the tests passed. Are they for improved performance or for Julia v0.3 support?

ararslan commented 7 years ago

In Julia 0.5, isnull is only defined for Nullable inputs, so that definition is necessary in case isnull gets called on something else. In 0.6 the signature for isnull has been widened.

cjprybol commented 7 years ago

I added a note that dropnull!(X::NullableVector) can be improved later when @nalimilan's logical indexing for deleteat! is ready for use. I ran some more benchmarks to see if the previous find(X.isnull) vs. (i for i in eachindex(X) if X.isnull[i]) result held up and it looks like it does. Using arrays of various sizes, types, and frequency of nulls is yielding the same thing.

julia> using NullableArrays

julia> using BenchmarkTools

julia> function dropnull_new!(X::NullableVector)
           deleteat!(X, (i for i in eachindex(X) if X.isnull[i])).values
       end
dropnull_new! (generic function with 1 method)

julia> srand(1)
MersenneTwister(UInt32[0x00000001],Base.dSFMT.DSFMT_state(Int32[1749029653,1072851681,1610647787,1072862326,1841712345,1073426746,-198061126,1073322060,-156153802,1073567984  …  1977574422,1073209915,278919868,1072835605,1290372147,18858467,1815133874,-1716870370,382,0]),[1.76601,1.18583,1.12095,1.76528,1.63716,1.95074,1.24557,1.31757,1.77156,1.6003  …  1.3864,1.446,1.81203,1.31359,1.93779,1.56674,1.38483,1.46984,1.75549,1.02138],382)

julia> small = NullableArray(rand([1, 2, 3, 4, Nullable()], Int(1e2)));

julia> medium = NullableArray(rand([1, 2, 3, 4, Nullable()], Int(1e4)));

julia> large = NullableArray(rand([1, 2, 3, 4, Nullable()], Int(1e6)));

julia> mixed_small = NullableArray(rand([1, :b, "c", false, Nullable()], Int(1e2)));

julia> mixed_medium = NullableArray(rand([1, :b, "c", false, Nullable()], Int(1e4)));

julia> mixed_large = NullableArray(rand([1, :b, "c", false, Nullable()], Int(1e6)));

julia> few_nulls = many_nulls = NullableArray(rand(vcat(collect(1:100), Nullable()), Int(1e6)));

julia> many_nulls = NullableArray(rand([1, Nullable()], Int(1e6)));

julia> @benchmark dropnull!(copy(small))
BenchmarkTools.Trial:
  memory estimate:  1.55 kb
  allocs estimate:  7
  --------------
  minimum time:     642.788 ns (0.00% GC)
  median time:      683.497 ns (0.00% GC)
  mean time:        776.364 ns (6.31% GC)
  maximum time:     9.129 μs (82.14% GC)
  --------------
  samples:          10000
  evals/sample:     165
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark dropnull_new!(copy(small))
BenchmarkTools.Trial:
  memory estimate:  2.94 kb
  allocs estimate:  61
  --------------
  minimum time:     1.079 μs (0.00% GC)
  median time:      1.178 μs (0.00% GC)
  mean time:        1.447 μs (12.53% GC)
  maximum time:     179.129 μs (97.49% GC)
  --------------
  samples:          10000
  evals/sample:     10
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark dropnull!(copy(medium))
BenchmarkTools.Trial:
  memory estimate:  104.23 kb
  allocs estimate:  8
  --------------
  minimum time:     67.861 μs (0.00% GC)
  median time:      74.607 μs (0.00% GC)
  mean time:        94.134 μs (11.43% GC)
  maximum time:     3.151 ms (93.16% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark dropnull_new!(copy(medium))
BenchmarkTools.Trial:
  memory estimate:  214.58 kb
  allocs estimate:  4048
  --------------
  minimum time:     123.054 μs (0.00% GC)
  median time:      128.369 μs (0.00% GC)
  mean time:        150.273 μs (8.13% GC)
  maximum time:     2.406 ms (92.09% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark dropnull!(copy(large))
BenchmarkTools.Trial:
  memory estimate:  10.11 mb
  allocs estimate:  10
  --------------
  minimum time:     7.744 ms (0.00% GC)
  median time:      9.143 ms (0.00% GC)
  mean time:        9.497 ms (6.86% GC)
  maximum time:     14.874 ms (12.86% GC)
  --------------
  samples:          527
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark dropnull_new!(copy(large))
BenchmarkTools.Trial:
  memory estimate:  20.81 mb
  allocs estimate:  400533
  --------------
  minimum time:     13.630 ms (0.00% GC)
  median time:      15.054 ms (5.80% GC)
  mean time:        15.698 ms (6.28% GC)
  maximum time:     30.699 ms (6.70% GC)
  --------------
  samples:          319
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark dropnull!(copy(mixed_small))
BenchmarkTools.Trial:
  memory estimate:  1.45 kb
  allocs estimate:  7
  --------------
  minimum time:     812.817 ns (0.00% GC)
  median time:      863.750 ns (0.00% GC)
  mean time:        975.505 ns (4.84% GC)
  maximum time:     13.684 μs (89.38% GC)
  --------------
  samples:          10000
  evals/sample:     82
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark dropnull_new!(copy(mixed_small))
BenchmarkTools.Trial:
  memory estimate:  2.63 kb
  allocs estimate:  51
  --------------
  minimum time:     1.155 μs (0.00% GC)
  median time:      1.239 μs (0.00% GC)
  mean time:        1.513 μs (10.85% GC)
  maximum time:     195.910 μs (97.78% GC)
  --------------
  samples:          10000
  evals/sample:     10
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark dropnull!(copy(mixed_medium))
BenchmarkTools.Trial:
  memory estimate:  104.17 kb
  allocs estimate:  8
  --------------
  minimum time:     111.944 μs (0.00% GC)
  median time:      123.489 μs (0.00% GC)
  mean time:        150.206 μs (7.48% GC)
  maximum time:     4.378 ms (76.66% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark dropnull_new!(copy(mixed_medium))
BenchmarkTools.Trial:
  memory estimate:  214.33 kb
  allocs estimate:  4040
  --------------
  minimum time:     161.621 μs (0.00% GC)
  median time:      173.905 μs (0.00% GC)
  mean time:        201.716 μs (6.91% GC)
  maximum time:     3.736 ms (69.56% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark dropnull!(copy(mixed_large))
BenchmarkTools.Trial:
  memory estimate:  10.11 mb
  allocs estimate:  10
  --------------
  minimum time:     11.913 ms (0.00% GC)
  median time:      14.260 ms (0.00% GC)
  mean time:        14.731 ms (8.55% GC)
  maximum time:     20.101 ms (19.81% GC)
  --------------
  samples:          340
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark dropnull_new!(copy(mixed_large))
BenchmarkTools.Trial:
  memory estimate:  20.76 mb
  allocs estimate:  399065
  --------------
  minimum time:     17.157 ms (0.00% GC)
  median time:      20.291 ms (11.12% GC)
  mean time:        20.787 ms (11.60% GC)
  maximum time:     28.305 ms (16.95% GC)
  --------------
  samples:          241
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark dropnull!(copy(few_nulls))
BenchmarkTools.Trial:
  memory estimate:  8.66 mb
  allocs estimate:  10
  --------------
  minimum time:     2.893 ms (0.00% GC)
  median time:      3.990 ms (0.00% GC)
  mean time:        4.574 ms (15.03% GC)
  maximum time:     9.104 ms (36.32% GC)
  --------------
  samples:          1093
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark dropnull_new!(copy(few_nulls))
BenchmarkTools.Trial:
  memory estimate:  9.18 mb
  allocs estimate:  19547
  --------------
  minimum time:     3.542 ms (0.00% GC)
  median time:      4.698 ms (0.00% GC)
  mean time:        5.352 ms (14.03% GC)
  maximum time:     9.884 ms (29.17% GC)
  --------------
  samples:          934
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark dropnull!(copy(many_nulls))
BenchmarkTools.Trial:
  memory estimate:  12.40 mb
  allocs estimate:  10
  --------------
  minimum time:     15.073 ms (0.00% GC)
  median time:      17.257 ms (0.00% GC)
  mean time:        17.450 ms (5.63% GC)
  maximum time:     22.696 ms (14.95% GC)
  --------------
  samples:          287
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark dropnull_new!(copy(many_nulls))
BenchmarkTools.Trial:
  memory estimate:  39.08 mb
  allocs estimate:  999453
  --------------
  minimum time:     26.177 ms (5.76% GC)
  median time:      29.440 ms (9.93% GC)
  mean time:        29.850 ms (9.74% GC)
  maximum time:     36.725 ms (11.56% GC)
  --------------
  samples:          168
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

Hope that isn't going overboard with the benchmarking. Any additional suggestions?

nalimilan commented 7 years ago

Any ideas how to get around the @inbounds issue or am I missing something? Or should we hold back on this until your pull request to allow boolean indexing with deleteat! is merged and use that?

Sorry for the late reply, I had missed your comments. The problem is that @inbounds currently returns nothing instead of the wrapped value. So you'll have to do something like:

@inbounds res = deleteat!(X, (i for i in eachindex(X) if X.isnull[i])).values
return res

Though your benchmarks show lots of allocations which shouldn't be there, which could indicate a deeper issue (maybe with deleteat! and inference). Since this can be made more efficient later by using logical indexing, I wouldn't spend too much time on this and I agree it's OK to use find for now.

cjprybol commented 7 years ago

Thanks so much for your help and feedback on this @nalimilan! @ararslan too!