Closed cjprybol closed 7 years ago
Thanks for doing this. The tests fail because the arrays contain Nullable{Int}
entries, even if they are printed like Int
. You'll need a slightly different definition of dropnull!
which calls convert(Vector{T}, ...)
with Nullable{T}
the element type of the input array. So you'll need a separate no-op definition for non-nullable arrays. You'll also have to check both the returned vector and the original vector, since they are different (though if everything is OK they should share the same storage).
As regards speed, you can improve it by accessing the values
and isnull
fields of the NullableArray
. That should be much more efficient. Actually, instead of calling convert
you can just return the values
field.
Finally, for maximum speed, you can adapt the algorithm used by deleteat!
so that you don't need to store the indices (which can take some space if there are many nulls), but instead remove the nulls on the fly. Shouldn't be hard.
Thanks for the feedback! I tried implementing what you suggested but backtracked. Generating the same return values as dropnull
requires the dropnull!
function to make a copy of the input vector, and those changes wouldn't be reflected inplace as the !
convention suggests. So I just limited the function to the simple deleteat!
use.
Is there an inplace array conversion function like convert!
that I've missed? I'm assuming reassignment of the input variable to a new array could be done using macros, eval, or maybe pointers, but I'm not sure if any of those are recommended. Happy to change tactics if there are any suggestions for how to inplace convert an Array from something like Array{Nullable{Int64}, 1}
to Array{Int64,1}
, I just couldn't find it.
Limitations/differences from dropnull
are reflected in the help message and I've updated the tests to reflect the differences. I also updated one of the dropnull
functions. It seemed that the res
array was unneccessary and the first array copy, Y
, could be modified instead. There was a slight performance improvement with the change and the test cases produced the expected output.
Thanks for the feedback! I tried implementing what you suggested but backtracked. Generating the same return values as dropnull requires the dropnull! function to make a copy of the input vector, and those changes wouldn't be reflected inplace as the ! convention suggests. So I just limited the function to the simple deleteat! use.
You don't need to make a copy, just call deleteat!
on both the values
and isnull
fields, and return the former. The original array will remain a NullableArray
, but without its missing values; the type cannot be changed in place but that's fine.
Is there an inplace array conversion function like convert! that I've missed? I'm assuming reassignment of the input variable to a new array could be done using macros, eval, or maybe pointers, but I'm not sure if any of those are recommended. Happy to change tactics if there are any suggestions for how to inplace convert an Array from something like Array{Nullable{Int64}, 1} to Array{Int64,1}, I just couldn't find it.
Limitations/differences from dropnull are reflected in the help message and I've updated the tests to reflect the differences. I also updated one of the dropnull functions. It seemed that the res array was unneccessary and the first array copy, Y, could be modified instead. There was a slight performance improvement with the change and the test cases produced the expected output.
Makes sense. Another optimization would be to check that Nullable <: eltype(Y)
at the top of the function, and if not return copy(X)
. Indeed, in most cases the array will not be able to hold Nullable
entries at all, in which case the filter + loop is not needed.
Finally, for maximum speed, you can adapt the algorithm used by deleteat! so that you don't need to store the indices (which can take some space if there are many nulls), but instead remove the nulls on the fly. Shouldn't be hard.
Actually, this suggestion I made is overkill for the NullableArray
method. You just need to do:
hasvalue = !X.isnull
deleteat!(X.values, hasvalue)
resize!(X.isnull, length(X.values))
fill!(X.isnull, false)
return X.values
Since the plan is to replace the isnull
field with
hasvalue(with
hasvalue == !isnull`), this will require no allocation at all at that point. We will even be a able to create views over the non-null entries without any copy.
Merging #179 into master will increase coverage by
0.26%
. The diff coverage is100%
.
@@ Coverage Diff @@
## master #179 +/- ##
=========================================
+ Coverage 63.43% 63.7% +0.26%
=========================================
Files 13 13
Lines 681 686 +5
=========================================
+ Hits 432 437 +5
Misses 249 249
Impacted Files | Coverage Δ | |
---|---|---|
src/primitives.jl | 100% <100%> (ø) |
:white_check_mark: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 53e5e46...affbb61. Read the comment docs.
Actually, this suggestion I made is overkill for the NullableArray method. You just need to do:
hasvalue = !X.isnull deleteat!(X.values, hasvalue) resize!(X.isnull, length(X.values)) fill!(X.isnull, false) return X.values
Since the plan is to replace the isnull field withhasvalue(withhasvalue == !isnull`), this will require no allocation at all at that point. We will even be a able to create views over the non-null entries without any copy.
This appears equivalent to deleteat!(X, find(X.isnull)).values
. deleteat!
requires indices, so using the boolean values of X.isnull
doesn't work and find()
is necessary. X.isnull
is updated during deleteat!
so the resize!(X.isnull, length(X.values))
and fill!(X.isnull, false)
don't appear to be necessary.
julia> hasvalue = !X.isnull
20-element Array{Bool,1}:
true
true
true
false
true
false
true
false
false
true
true
true
true
false
true
false
true
true
true
true
julia> deleteat!(X.values, X.isnull)
ERROR: ArgumentError: indices must be unique and sorted
in deleteat!(::Array{Int64,1}, ::Array{Bool,1}) at ./array.jl:614
julia> deleteat!(X, find(X.isnull))
14-element NullableArrays.NullableArray{Int64,1}:
1
0
1
1
0
1
1
1
0
1
1
0
1
1
julia> X.values
14-element Array{Int64,1}:
1
0
1
1
0
1
1
1
0
1
1
0
1
1
julia> X.isnull
14-element Array{Bool,1}:
false
false
false
false
false
false
false
false
false
false
false
false
false
false
I think that covers everything except for
Another optimization would be to check that Nullable <: eltype(Y) at the top of the function, and if not return copy(X). Indeed, in most cases the array will not be able to hold Nullable entries at all, in which case the filter + loop is not needed.
I'll have a go at that now
Yes, I just realized logical indexing wasn't supported (https://github.com/JuliaLang/julia/pull/20465). But you can work around that by doing deleteat!(X, (i for i in eachindex(X) if @inbounds X.isnull[i]))
, which does not allocate a new vector at all. I know my proposal was more complex than the shorter form you suggested, but I figured it might be faster since filling X.isnull
with false
should be more efficient than removing entries one by one (which calling deleteat!(X, ...)
does under the hood). Would need to check that.
Thanks for explaining the allocation logic behind your suggestion (and for the other comments as well), that makes perfect sense! I hit some trouble with the @inbounds
. It seems like the @inbounds
isn't parsed correctly with that inline usage because it works fine with the @inbounds
removed. Putting it in parenthesis didn't help either.
julia> X = NullableArray([1, 2, 3, Nullable(), 5, Nullable(), 7, Nullable()])
8-element NullableArrays.NullableArray{Int64,1}:
1
2
3
#NULL
5
#NULL
7
#NULL
julia> for i in (i for i in eachindex(X) if X.isnull[i])
@show i
end
i = 4
i = 6
i = 8
julia> for i in (i for i in eachindex(X) if @inbounds X.isnull[i])
@show i
end
ERROR: TypeError: non-boolean (Void) used in boolean context
in start_filter(::##6#8, ::Base.OneTo{Int64}) at ./iterator.jl:196
in start(::Filter{##6#8,Base.OneTo{Int64}}) at ./iterator.jl:191
in anonymous at ./<missing>:?
julia> for i in (i for i in eachindex(X) if (@inbounds X.isnull[i]))
@show i
end
ERROR: TypeError: non-boolean (Void) used in boolean context
in start_filter(::##10#12, ::Base.OneTo{Int64}) at ./iterator.jl:196
in start(::Filter{##10#12,Base.OneTo{Int64}}) at ./iterator.jl:191
in anonymous at ./<missing>:?
without the use of @inbounds
the proposal seems a little slower
julia> using NullableArrays
julia> using BenchmarkTools
julia> import NullableArrays.dropnull!
julia> dropnull!(X::NullableVector) = deleteat!(X, find(X.isnull)).values
WARNING: Method definition dropnull!(NullableArrays.NullableArray{T<:Any, 1}) in module NullableArrays at /Users/Cameron/.julia/v0.5/NullableArrays/src/primitives.jl:202 overwritten in module Main at REPL[4]:1.
dropnull! (generic function with 3 methods)
julia> function dropnull_new!(X::NullableVector)
deleteat!(X, (i for i in eachindex(X) if X.isnull[i])).values
end
dropnull_new! (generic function with 1 method)
julia> srand(1)
MersenneTwister(UInt32[0x00000001],Base.dSFMT.DSFMT_state(Int32[1749029653,1072851681,1610647787,1072862326,1841712345,1073426746,-198061126,1073322060,-156153802,1073567984 … 1977574422,1073209915,278919868,1072835605,1290372147,18858467,1815133874,-1716870370,382,0]),[1.49131,1.32188,1.20941,1.67663,1.36239,1.92045,1.21571,1.71665,1.11597,1.15241 … 1.11209,1.02599,1.88475,1.65876,1.57727,1.96827,1.08836,1.32939,1.40152,1.01871],382)
julia> X = NullableArray(rand([1, 2, 3, 4, Nullable()], Int(1e6)));
julia> @benchmark dropnull!(X)
BenchmarkTools.Trial:
memory estimate: 80.00 bytes
allocs estimate: 1
--------------
minimum time: 1.083 ms (0.00% GC)
median time: 1.096 ms (0.00% GC)
mean time: 1.139 ms (0.00% GC)
maximum time: 2.435 ms (0.00% GC)
--------------
samples: 4389
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> X = NullableArray(rand([1, 2, 3, 4, Nullable()], Int(1e6)));
julia> @benchmark dropnull!(X)
BenchmarkTools.Trial:
memory estimate: 80.00 bytes
allocs estimate: 1
--------------
minimum time: 1.085 ms (0.00% GC)
median time: 1.097 ms (0.00% GC)
mean time: 1.150 ms (0.00% GC)
maximum time: 2.943 ms (0.00% GC)
--------------
samples: 4347
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> srand(1)
MersenneTwister(UInt32[0x00000001],Base.dSFMT.DSFMT_state(Int32[1749029653,1072851681,1610647787,1072862326,1841712345,1073426746,-198061126,1073322060,-156153802,1073567984 … 1977574422,1073209915,278919868,1072835605,1290372147,18858467,1815133874,-1716870370,382,0]),[1.2326,1.10967,1.95986,1.77561,1.64916,1.07679,1.76917,1.60431,1.80621,1.10386 … 1.48308,1.69298,1.08006,1.27888,1.24528,1.56128,1.33465,1.55301,1.65853,1.58222],382)
julia> X = NullableArray(rand([1, 2, 3, 4, Nullable()], Int(1e6)));
julia> @benchmark dropnull_new!(X)
BenchmarkTools.Trial:
memory estimate: 64.00 bytes
allocs estimate: 3
--------------
minimum time: 1.441 ms (0.00% GC)
median time: 1.464 ms (0.00% GC)
mean time: 1.514 ms (0.00% GC)
maximum time: 3.139 ms (0.00% GC)
--------------
samples: 3302
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> X = NullableArray(rand([1, 2, 3, 4, Nullable()], Int(1e6)));
julia> @benchmark dropnull_new!(X)
BenchmarkTools.Trial:
memory estimate: 64.00 bytes
allocs estimate: 3
--------------
minimum time: 1.442 ms (0.00% GC)
median time: 1.464 ms (0.00% GC)
mean time: 1.518 ms (0.00% GC)
maximum time: 3.179 ms (0.00% GC)
--------------
samples: 3294
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
Any ideas how to get around the @inbounds
issue or am I missing something? Or should we hold back on this until your pull request to allow boolean indexing with deleteat!
is merged and use that?
And while I'm looking at it, could you explain what the _isnull
definitions starting at line 153 are for? I wasn't sure if they were necessary so I tried removing them and replacing their calls with isnull
and the tests passed. Are they for improved performance or for Julia v0.3 support?
In Julia 0.5, isnull
is only defined for Nullable
inputs, so that definition is necessary in case isnull
gets called on something else. In 0.6 the signature for isnull
has been widened.
I added a note that dropnull!(X::NullableVector)
can be improved later when @nalimilan's logical indexing for deleteat!
is ready for use. I ran some more benchmarks to see if the previous find(X.isnull)
vs. (i for i in eachindex(X) if X.isnull[i])
result held up and it looks like it does. Using arrays of various sizes, types, and frequency of nulls is yielding the same thing.
julia> using NullableArrays
julia> using BenchmarkTools
julia> function dropnull_new!(X::NullableVector)
deleteat!(X, (i for i in eachindex(X) if X.isnull[i])).values
end
dropnull_new! (generic function with 1 method)
julia> srand(1)
MersenneTwister(UInt32[0x00000001],Base.dSFMT.DSFMT_state(Int32[1749029653,1072851681,1610647787,1072862326,1841712345,1073426746,-198061126,1073322060,-156153802,1073567984 … 1977574422,1073209915,278919868,1072835605,1290372147,18858467,1815133874,-1716870370,382,0]),[1.76601,1.18583,1.12095,1.76528,1.63716,1.95074,1.24557,1.31757,1.77156,1.6003 … 1.3864,1.446,1.81203,1.31359,1.93779,1.56674,1.38483,1.46984,1.75549,1.02138],382)
julia> small = NullableArray(rand([1, 2, 3, 4, Nullable()], Int(1e2)));
julia> medium = NullableArray(rand([1, 2, 3, 4, Nullable()], Int(1e4)));
julia> large = NullableArray(rand([1, 2, 3, 4, Nullable()], Int(1e6)));
julia> mixed_small = NullableArray(rand([1, :b, "c", false, Nullable()], Int(1e2)));
julia> mixed_medium = NullableArray(rand([1, :b, "c", false, Nullable()], Int(1e4)));
julia> mixed_large = NullableArray(rand([1, :b, "c", false, Nullable()], Int(1e6)));
julia> few_nulls = many_nulls = NullableArray(rand(vcat(collect(1:100), Nullable()), Int(1e6)));
julia> many_nulls = NullableArray(rand([1, Nullable()], Int(1e6)));
julia> @benchmark dropnull!(copy(small))
BenchmarkTools.Trial:
memory estimate: 1.55 kb
allocs estimate: 7
--------------
minimum time: 642.788 ns (0.00% GC)
median time: 683.497 ns (0.00% GC)
mean time: 776.364 ns (6.31% GC)
maximum time: 9.129 μs (82.14% GC)
--------------
samples: 10000
evals/sample: 165
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark dropnull_new!(copy(small))
BenchmarkTools.Trial:
memory estimate: 2.94 kb
allocs estimate: 61
--------------
minimum time: 1.079 μs (0.00% GC)
median time: 1.178 μs (0.00% GC)
mean time: 1.447 μs (12.53% GC)
maximum time: 179.129 μs (97.49% GC)
--------------
samples: 10000
evals/sample: 10
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark dropnull!(copy(medium))
BenchmarkTools.Trial:
memory estimate: 104.23 kb
allocs estimate: 8
--------------
minimum time: 67.861 μs (0.00% GC)
median time: 74.607 μs (0.00% GC)
mean time: 94.134 μs (11.43% GC)
maximum time: 3.151 ms (93.16% GC)
--------------
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark dropnull_new!(copy(medium))
BenchmarkTools.Trial:
memory estimate: 214.58 kb
allocs estimate: 4048
--------------
minimum time: 123.054 μs (0.00% GC)
median time: 128.369 μs (0.00% GC)
mean time: 150.273 μs (8.13% GC)
maximum time: 2.406 ms (92.09% GC)
--------------
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark dropnull!(copy(large))
BenchmarkTools.Trial:
memory estimate: 10.11 mb
allocs estimate: 10
--------------
minimum time: 7.744 ms (0.00% GC)
median time: 9.143 ms (0.00% GC)
mean time: 9.497 ms (6.86% GC)
maximum time: 14.874 ms (12.86% GC)
--------------
samples: 527
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark dropnull_new!(copy(large))
BenchmarkTools.Trial:
memory estimate: 20.81 mb
allocs estimate: 400533
--------------
minimum time: 13.630 ms (0.00% GC)
median time: 15.054 ms (5.80% GC)
mean time: 15.698 ms (6.28% GC)
maximum time: 30.699 ms (6.70% GC)
--------------
samples: 319
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark dropnull!(copy(mixed_small))
BenchmarkTools.Trial:
memory estimate: 1.45 kb
allocs estimate: 7
--------------
minimum time: 812.817 ns (0.00% GC)
median time: 863.750 ns (0.00% GC)
mean time: 975.505 ns (4.84% GC)
maximum time: 13.684 μs (89.38% GC)
--------------
samples: 10000
evals/sample: 82
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark dropnull_new!(copy(mixed_small))
BenchmarkTools.Trial:
memory estimate: 2.63 kb
allocs estimate: 51
--------------
minimum time: 1.155 μs (0.00% GC)
median time: 1.239 μs (0.00% GC)
mean time: 1.513 μs (10.85% GC)
maximum time: 195.910 μs (97.78% GC)
--------------
samples: 10000
evals/sample: 10
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark dropnull!(copy(mixed_medium))
BenchmarkTools.Trial:
memory estimate: 104.17 kb
allocs estimate: 8
--------------
minimum time: 111.944 μs (0.00% GC)
median time: 123.489 μs (0.00% GC)
mean time: 150.206 μs (7.48% GC)
maximum time: 4.378 ms (76.66% GC)
--------------
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark dropnull_new!(copy(mixed_medium))
BenchmarkTools.Trial:
memory estimate: 214.33 kb
allocs estimate: 4040
--------------
minimum time: 161.621 μs (0.00% GC)
median time: 173.905 μs (0.00% GC)
mean time: 201.716 μs (6.91% GC)
maximum time: 3.736 ms (69.56% GC)
--------------
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark dropnull!(copy(mixed_large))
BenchmarkTools.Trial:
memory estimate: 10.11 mb
allocs estimate: 10
--------------
minimum time: 11.913 ms (0.00% GC)
median time: 14.260 ms (0.00% GC)
mean time: 14.731 ms (8.55% GC)
maximum time: 20.101 ms (19.81% GC)
--------------
samples: 340
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark dropnull_new!(copy(mixed_large))
BenchmarkTools.Trial:
memory estimate: 20.76 mb
allocs estimate: 399065
--------------
minimum time: 17.157 ms (0.00% GC)
median time: 20.291 ms (11.12% GC)
mean time: 20.787 ms (11.60% GC)
maximum time: 28.305 ms (16.95% GC)
--------------
samples: 241
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark dropnull!(copy(few_nulls))
BenchmarkTools.Trial:
memory estimate: 8.66 mb
allocs estimate: 10
--------------
minimum time: 2.893 ms (0.00% GC)
median time: 3.990 ms (0.00% GC)
mean time: 4.574 ms (15.03% GC)
maximum time: 9.104 ms (36.32% GC)
--------------
samples: 1093
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark dropnull_new!(copy(few_nulls))
BenchmarkTools.Trial:
memory estimate: 9.18 mb
allocs estimate: 19547
--------------
minimum time: 3.542 ms (0.00% GC)
median time: 4.698 ms (0.00% GC)
mean time: 5.352 ms (14.03% GC)
maximum time: 9.884 ms (29.17% GC)
--------------
samples: 934
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark dropnull!(copy(many_nulls))
BenchmarkTools.Trial:
memory estimate: 12.40 mb
allocs estimate: 10
--------------
minimum time: 15.073 ms (0.00% GC)
median time: 17.257 ms (0.00% GC)
mean time: 17.450 ms (5.63% GC)
maximum time: 22.696 ms (14.95% GC)
--------------
samples: 287
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark dropnull_new!(copy(many_nulls))
BenchmarkTools.Trial:
memory estimate: 39.08 mb
allocs estimate: 999453
--------------
minimum time: 26.177 ms (5.76% GC)
median time: 29.440 ms (9.93% GC)
mean time: 29.850 ms (9.74% GC)
maximum time: 36.725 ms (11.56% GC)
--------------
samples: 168
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
Hope that isn't going overboard with the benchmarking. Any additional suggestions?
Any ideas how to get around the @inbounds issue or am I missing something? Or should we hold back on this until your pull request to allow boolean indexing with deleteat! is merged and use that?
Sorry for the late reply, I had missed your comments. The problem is that @inbounds
currently returns nothing
instead of the wrapped value. So you'll have to do something like:
@inbounds res = deleteat!(X, (i for i in eachindex(X) if X.isnull[i])).values
return res
Though your benchmarks show lots of allocations which shouldn't be there, which could indicate a deeper issue (maybe with deleteat!
and inference). Since this can be made more efficient later by using logical indexing, I wouldn't spend too much time on this and I agree it's OK to use find
for now.
Thanks so much for your help and feedback on this @nalimilan! @ararslan too!
I wanted to add a
dropnull!
function here before extending bothdropnull
anddropnull!
in DataFrames (see https://github.com/JuliaStats/DataFrames.jl/issues/602 https://github.com/JuliaStats/DataFrames.jl/issues/1156)This pull request currently does not pass the test cases that have been established for the current
dropnull
function, which I have used for testingdropnull!
output
It's unclear to me why in the second test case
dropnull
passes whiledropnull!
doesn't even though the outputs are bothArray{Any,1}
. Usingisequal()
does not change the test outcome.I'm also interested in feedback on the third test case, where
dropnull
producesArray{Any,1}
anddropnull!
producesArray{Nullable,1}
. Is there a reason to prefer one over the other?The benchmarks look ok except for
dropnull!(C)
, where the inplace version is slower thandropnull(C)
for reasons I don't understand.dropnull!(D)
makes more allocations thandropnull(D)
but is still fasterbenchmark
results
I think that's enough questions for one post. Thanks in advance for any advice on how to proceed.