Mapreduce performance - Githubissues

davidagold commented 9 years ago

Here are my results from running the profiling methods included in https://github.com/johnmyleswhite/NullableArrays.jl/blob/master/perf/mapreduce.jl:

julia> profile_all()
Method: mapreduce(f, op, A)
  for Array{Float64}:           246.070 milliseconds (15008 k allocations: 229 MB, 9.91% gc time)
  for NullableArray{Float64}:   417.401 milliseconds (15008 k allocations: 458 MB, 15.67% gc time)
  for DataArray{Float64}:       278.743 milliseconds (15008 k allocations: 229 MB, 9.67% gc time)
Method: reduce(op, A)
  for Array{Float64}:             2.629 milliseconds (1 allocation: 16 bytes)
  for NullableArray{Float64}:     5.286 milliseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:         2.851 milliseconds (2 allocations: 96 bytes)
Method: sum(A)
  for Array{Float64}:             2.810 milliseconds
  for NullableArray{Float64}:     6.432 milliseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:         5.209 milliseconds (2 allocations: 96 bytes)
Method: sum(f, A)
  for Array{Float64}:           252.220 milliseconds (15008 k allocations: 229 MB, 10.83% gc time)
  for NullableArray{Float64}:   424.200 milliseconds (15008 k allocations: 458 MB, 15.71% gc time)
  for DataArray{Float64}:       269.271 milliseconds (15008 k allocations: 229 MB, 9.69% gc time)
Method: prod(A)
  for Array{Float64}:             8.409 milliseconds
  for NullableArray{Float64}:    23.911 milliseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:         8.767 milliseconds (2 allocations: 96 bytes)
Method: prod(f, A)
  for Array{Float64}:           255.930 milliseconds (15008 k allocations: 229 MB, 10.65% gc time)
  for NullableArray{Float64}:       410 milliseconds (15008 k allocations: 458 MB, 16.58% gc time)
  for DataArray{Float64}:       272.405 milliseconds (15008 k allocations: 229 MB, 9.38% gc time)
Method: minimum(A)
  for Array{Float64}:             3.702 milliseconds
  for NullableArray{Float64}:    33.607 milliseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:         6.642 milliseconds (2 allocations: 96 bytes)
Method: minimum(f, A)
  for Array{Float64}:           238.662 milliseconds (10000 k allocations: 153 MB, 7.49% gc time)
  for NullableArray{Float64}:   517.422 milliseconds (15000 k allocations: 381 MB, 13.49% gc time)
  for DataArray{Float64}:       254.674 milliseconds (10000 k allocations: 153 MB, 7.48% gc time)
Method: maximum(A)
  for Array{Float64}:             7.461 milliseconds
  for NullableArray{Float64}:    24.961 milliseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:         6.270 milliseconds (2 allocations: 96 bytes)
Method: maximum(f, A)
  for Array{Float64}:           239.670 milliseconds (10000 k allocations: 153 MB, 8.19% gc time)
  for NullableArray{Float64}:   511.292 milliseconds (15000 k allocations: 381 MB, 14.05% gc time)
  for DataArray{Float64}:       245.280 milliseconds (10000 k allocations: 153 MB, 8.63% gc time)
Method: sumabs(A)
  for Array{Float64}:             2.756 milliseconds
  for NullableArray{Float64}:    28.615 milliseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:         2.654 milliseconds (2 allocations: 96 bytes)
Method: sumabs2(A)
  for Array{Float64}:             3.294 milliseconds
  for NullableArray{Float64}:    28.927 milliseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:         3.470 milliseconds (2 allocations: 96 bytes)

julia> profile_skip(true)
Comparison of skipnull/skipNA methods

f := IdFun(), op := AddFun()
mapreduce(f, op, X; skipnull/skipNA=true) (0 missing entries)
  for NullableArray{Float64}:     8.569 milliseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:         2.741 milliseconds (2 allocations: 96 bytes)

reduce(op, X; skipnull/skipNA=true) (0 missing entries)
  for NullableArray{Float64}:     9.563 milliseconds (3 allocations: 192 bytes)
  for DataArray{Float64}:         2.736 milliseconds (3 allocations: 176 bytes)

mapreduce(f, op, X; skipnull/skipNA=true) (~half missing entries)
  for NullableArray{Float64}:    52.518 milliseconds (2 allocations: 96 bytes)
  for DataArray{Float64}:        53.641 milliseconds (2 allocations: 96 bytes)

reduce(op, X; skipnull/skipNA=true) (~half missing entries)
  for NullableArray{Float64}:    54.764 milliseconds (3 allocations: 176 bytes)
  for DataArray{Float64}:        51.647 milliseconds (3 allocations: 176 bytes)

julia> profile_skip(false)
Comparison of skipnull/skipNA methods

f := IdFun(), op := AddFun()
mapreduce(f, op, X; skipnull/skipNA=false) (0 missing entries)
  for NullableArray{Float64}:     5.855 milliseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:         2.911 milliseconds (2 allocations: 96 bytes)

reduce(op, X; skipnull/skipNA=false) (0 missing entries)
  for NullableArray{Float64}:     9.874 milliseconds (3 allocations: 192 bytes)
  for DataArray{Float64}:         5.123 milliseconds (3 allocations: 176 bytes)

mapreduce(f, op, X; skipnull/skipNA=false) (~half missing entries)
  for NullableArray{Float64}:     5.486 milliseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:           753 nanoseconds  (1 allocation: 80 bytes)

reduce(op, X; skipnull/skipNA=false) (~half missing entries)
  for NullableArray{Float64}:     6.782 milliseconds (3 allocations: 192 bytes)
  for DataArray{Float64}:        50.259 milliseconds (3 allocations: 176 bytes)

julia> profile_skip_impl()
Comparison of internal skip methods:
mapreduce_impl_skipnull(f, op, X) (0 missing entries)
  for NullableArray{Float64}:     7.470 milliseconds (1 allocation: 16 bytes)
  for DataArray{Float64}:         3.644 milliseconds (1 allocation: 16 bytes)

mapreduce_impl_skipnull(f, op, X) (~half missing entries)
  for NullableArray{Float64}:    29.196 milliseconds (1 allocation: 16 bytes)
  for DataArray{Float64}:        51.622 milliseconds (1 allocation: 16 bytes)

Though more than for broadcast, there is still relatively little specialized code for mapreduce other specialized methods for skipping over null entries and hooking them into the general mapreduce interface. A few noteworthy items:

sumabs and sumabs2 are 10x faster than they were last week without my having touched them. I suspect this is due to improvements in Base Julia, possibly to do with codegen? Now I'm beginning to understand the importance of rigorously tracking performance against environmental variables, since it would have been interesting to see what caused the speedup.
The internal skipnull method is twice as fast for NullableArrays as it is for DataArrays, but that speedup is lost by the time get to the exposed interface.
Again, we see higher allocations in general for NullableArrays.

davidagold commented 9 years ago

The tweaks in https://github.com/johnmyleswhite/NullableArrays.jl/commit/76309a08b09b57e47cbfb23bb1a6243c18abfe9b#diff-88161f25d2abf2976952b3a2c500b755 appear to have closed some of the performance gaps that can be seen above. The remaining gaps seem restricted to cases in which a mapreduce-related method is called on a NullableArray{T} with empty values with skipnull=false. These gaps could be closed if we decide that, in such cases, it is okay simply to return Nullable{T}() instead of letting the type of the resultant empty Nullable object be determined by mapreducing without skipping.

IMO both approaches to determining the type of the empty Nullable to be returned are somewhat arbitrary. One the one hand, and for instance, applying such a method to an X::NullableArray{Int} might almost certainly return a Nullable{Float} if there are no null entries/if null entries are skipped. In this case, returning Nullable{Int}() may not seem appropriate. However, one could also imagine a case in which present values of X are so constrained that applying a different mapreduce method would always return a Nullable{Int} if there are no null entries/if null entries are skipped. In this case, letting the parameter of the returned Nullable object be determined by whatever values to which "empty" entries in the X.values field happen to be initialized seems entirely whimsical.

I'm slightly more of the mind that such cases ought to return Nullable{T}() and avoid determining the type via lifted computation. I think this approach has the virtues of predictability and performance, at least with respect to actually returning the empty Nullable result from the mapreduce method. If users require another behavior, they can annotate the returned object.

Most recent run:

julia> profile_reduce_methods()
Method: mapreduce(f, op, A) (0 missing entries)
  for Array{Float64}:           297.994 milliseconds (15008 k allocations: 229 MB, 19.00% gc time)
  for NullableArray{Float64}:   302.860 milliseconds (15008 k allocations: 229 MB, 18.29% gc time)
  for DataArray{Float64}:       327.860 milliseconds (15008 k allocations: 229 MB, 17.56% gc time)

Method: mapreduce(f, op, A) (~half missing entries, skip=false)
  for NullableArray{Float64}:   498.262 milliseconds (15008 k allocations: 458 MB, 28.21% gc time)
  for DataArray{Float64}:       277.049 milliseconds (5009 k allocations: 78259 KB, 7.22% gc time)

Method: mapreduce(f, op, A) (~half missing entries, skip=true)
  for NullableArray{Float64}:   237.232 milliseconds (7542 k allocations: 115 MB, 11.15% gc time)
  for DataArray{Float64}:       224.135 milliseconds (7517 k allocations: 115 MB, 14.12% gc time)

Method: reduce(f, op, A) (0 missing entries)
  for Array{Float64}:             2.716 milliseconds (1 allocation: 16 bytes)
  for NullableArray{Float64}:     5.723 milliseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:         2.728 milliseconds (2 allocations: 96 bytes)

Method: reduce(f, op, A) (~half missing entries, skip=false)
  for NullableArray{Float64}:     6.121 milliseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:         4.340 microseconds (1 allocation: 80 bytes)

Method: reduce(f, op, A) (~half missing entries, skip=true)
  for NullableArray{Float64}:    52.310 milliseconds (3 allocations: 192 bytes)
  for DataArray{Float64}:        52.839 milliseconds (3 allocations: 176 bytes)

Method: sum(A) (0 missing entries)
  for Array{Float64}:             3.348 milliseconds (1 allocation: 16 bytes)
  for NullableArray{Float64}:     7.068 milliseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:         3.102 milliseconds (2 allocations: 96 bytes)

Method: sum(f, A) (0 missing entries)
  for Array{Float64}:           317.245 milliseconds (15008 k allocations: 229 MB, 18.68% gc time)
  for NullableArray{Float64}:   295.854 milliseconds (15008 k allocations: 229 MB, 18.23% gc time)
  for DataArray{Float64}:       327.310 milliseconds (15008 k allocations: 229 MB, 17.98% gc time)

Method: prod(A) (0 missing entries)
  for Array{Float64}:             8.554 milliseconds (1 allocation: 16 bytes)
  for NullableArray{Float64}:    12.285 milliseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:        11.440 milliseconds (2 allocations: 96 bytes)

Method: prod(f, A) (0 missing entries)
  for Array{Float64}:           302.056 milliseconds (15008 k allocations: 229 MB, 16.97% gc time)
  for NullableArray{Float64}:   320.901 milliseconds (15008 k allocations: 229 MB, 18.20% gc time)
  for DataArray{Float64}:       331.193 milliseconds (15008 k allocations: 229 MB, 15.79% gc time)

Method: minimum(A) (0 missing entries)
  for Array{Float64}:             3.910 milliseconds (1 allocation: 16 bytes)
  for NullableArray{Float64}:     6.827 milliseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:         4.421 milliseconds (2 allocations: 96 bytes)

Method: minimum(f, A) (0 missing entries)
  for Array{Float64}:           278.363 milliseconds (10000 k allocations: 153 MB, 13.22% gc time)
  for NullableArray{Float64}:   282.217 milliseconds (10000 k allocations: 153 MB, 12.66% gc time)
  for DataArray{Float64}:       283.047 milliseconds (10000 k allocations: 153 MB, 12.57% gc time)

Method: maximum(A) (0 missing entries)
  for Array{Float64}:             4.610 milliseconds (1 allocation: 16 bytes)
  for NullableArray{Float64}:     6.200 milliseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:         4.316 milliseconds (2 allocations: 96 bytes)

Method: maximum(f, A) (0 missing entries)
  for Array{Float64}:           300.411 milliseconds (10000 k allocations: 153 MB, 13.01% gc time)
  for NullableArray{Float64}:   309.668 milliseconds (10000 k allocations: 153 MB, 12.73% gc time)
  for DataArray{Float64}:       341.655 milliseconds (10000 k allocations: 153 MB, 12.69% gc time)

Method: sum(A) (~half missing entries, skip=false)
  for NullableArray{Float64}:     6.230 milliseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:         4.670 microseconds (1 allocation: 80 bytes)

Method: sum(f, A) (~half missing entries, skip=false)
  for NullableArray{Float64}:   504.233 milliseconds (15008 k allocations: 458 MB, 28.99% gc time)
  for DataArray{Float64}:       286.009 milliseconds (5009 k allocations: 78259 KB, 7.37% gc time)

Method: prod(A) (~half missing entries, skip=false)
  for NullableArray{Float64}:    24.736 milliseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:        24.600 microseconds (1 allocation: 80 bytes)

Method: prod(f, A) (~half missing entries, skip=false)
  for NullableArray{Float64}:   485.244 milliseconds (15008 k allocations: 458 MB, 30.08% gc time)
  for DataArray{Float64}:       287.764 milliseconds (5009 k allocations: 78259 KB, 7.06% gc time)

Method: minimum(A) (~half missing entries, skip=false)
  for NullableArray{Float64}:    13.752 microseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:         3.793 microseconds (1 allocation: 80 bytes)

Method: minimum(f, A) (~half missing entries, skip=false)
  for NullableArray{Float64}:   334.809 milliseconds (7501 k allocations: 191 MB, 22.37% gc time)
  minimum(f, A::DataArray) currently incurs error

Method: maximum(A) (~half missing entries, skip=false)
  for NullableArray{Float64}:    39.744 microseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:         5.082 microseconds (1 allocation: 80 bytes)

Method: maximum(f, A) (~half missing entries, skip=false)
  for NullableArray{Float64}:   317.843 milliseconds (7501 k allocations: 191 MB, 22.94% gc time)
  maximum(f, A::DataArray) currently incurs error

Method: sum(A) (~half missing entries, skip=true)
  for NullableArray{Float64}:        49 milliseconds (3 allocations: 192 bytes)
  for DataArray{Float64}:        53.485 milliseconds (3 allocations: 176 bytes)

Method: sum(f, A) (~half missing entries, skip=true)
  for NullableArray{Float64}:   236.735 milliseconds (7542 k allocations: 115 MB, 11.03% gc time)
  for DataArray{Float64}:       208.877 milliseconds (7517 k allocations: 115 MB, 14.66% gc time)

Method: prod(A) (~half missing entries, skip=true)
  for NullableArray{Float64}:    53.514 milliseconds (3 allocations: 192 bytes)
  for DataArray{Float64}:        52.948 milliseconds (3 allocations: 176 bytes)

Method: prod(f, A) (~half missing entries, skip=true)
  for NullableArray{Float64}:   213.132 milliseconds (7501 k allocations: 114 MB, 12.32% gc time)
  for DataArray{Float64}:       205.867 milliseconds (7501 k allocations: 114 MB, 16.28% gc time)

Method: minimum(A) (~half missing entries, skip=true)
  for NullableArray{Float64}:    55.691 milliseconds (3 allocations: 192 bytes)
  for DataArray{Float64}:        49.199 milliseconds (3 allocations: 176 bytes)

Method: minimum(f, A) (~half missing entries, skip=true)
  for NullableArray{Float64}:   220.994 milliseconds (7501 k allocations: 114 MB, 13.32% gc time)
  for DataArray{Float64}:       215.215 milliseconds (7501 k allocations: 114 MB, 14.37% gc time)

Method: maximum(A) (~half missing entries, skip=true)
  for NullableArray{Float64}:    59.045 milliseconds (3 allocations: 192 bytes)
  for DataArray{Float64}:        52.982 milliseconds (3 allocations: 176 bytes)

Method: maximum(f, A) (~half missing entries, skip=true)
  for NullableArray{Float64}:   209.707 milliseconds (7501 k allocations: 114 MB, 13.96% gc time)
  for DataArray{Float64}:       225.224 milliseconds (7501 k allocations: 114 MB, 12.86% gc time)

Method: sumabs(A) (0 missing entries)
  for Array{Float64}:             2.486 milliseconds (1 allocation: 16 bytes)
  for NullableArray{Float64}:     4.855 milliseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:         2.606 milliseconds (2 allocations: 96 bytes)

Method: sumabs2(A) (0 missing entries)
  for Array{Float64}:             8.069 milliseconds (1 allocation: 16 bytes)
  for NullableArray{Float64}:     5.849 milliseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:         5.690 milliseconds (2 allocations: 96 bytes)

Method: sumabs(A) (~half missing entries, skip=false)
  for NullableArray{Float64}:    31.779 milliseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:         4.673 microseconds (1 allocation: 80 bytes)

Method: sumabs2(A) (~half missing entries, skip=false)
  for NullableArray{Float64}:    28.866 milliseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:         4.015 microseconds (1 allocation: 80 bytes)

Method: sumabs(A) (~half missing entries, skip=true)
  for NullableArray{Float64}:    50.623 milliseconds (3 allocations: 192 bytes)
  for DataArray{Float64}:        56.097 milliseconds (3 allocations: 176 bytes)

Method: sumabs2(A) (~half missing entries, skip=true)
  for NullableArray{Float64}:    57.878 milliseconds (3 allocations: 192 bytes)
  for DataArray{Float64}:        56.935 milliseconds (3 allocations: 176 bytes)

nalimilan commented 9 years ago

I'm not sure I completely understand the problem, but this sounds similar to the question of the type to return from a comprehension on an empty iterator input. One of the solutions discussed there is to return an Array{Union()}: https://github.com/JuliaLang/julia/issues/7258#issuecomment-82602572 https://github.com/JuliaLang/julia/pull/11034

Would it make sense to return a Nullable{Union()} here?

davidagold commented 9 years ago

That issue does seem quite similar, insofar as it too concerns what to do with the eltype of an empty container returned from some computation. But the present issue is somewhat different. Let me try to make it more explicit.

Currently, lifted binary operators such as +(x::Nullable{S1}, y::Nullable{S2}) will return Nullable(x.value + y.value, x.isnull | y.isnull) if S1 and S2 are isbits types, and will throw an error otherwise. Thus we obtain the following behavior:

julia> Y = NullableArray(Int, 5)
5-element NullableArrays.NullableArray{Int64,1}:
 Nullable{Int64}()
 Nullable{Int64}()
 Nullable{Int64}()
 Nullable{Int64}()
 Nullable{Int64}()

julia> Y[2:5] = [2:5...]
4-element Array{Int64,1}:
 2
 3
 4
 5

julia> Y
5-element NullableArrays.NullableArray{Int64,1}:
 Nullable{Int64}()
 Nullable(2)      
 Nullable(3)      
 Nullable(4)      
 Nullable(5)  

julia> reduce(+, Y)
Nullable{Int64}()

julia> ans.value
12884901905

This is fine, since you oughtn't to expect a meaningful (or even safe) answer from retrieving the value field of a null Nullable. However, it does mean that, depending on the nature of the reducing operation op, the "value" of a null entry can determine the type of the empty Nullable object returned from reducing op over a NullableArray:

julia> op(x::Nullable, y::Nullable) = Nullable(x.value + y.value < 10 ? 5 : 5.0, x.isnull | y.isnull)
op (generic function with 1 method)

julia> X = NullableArray([1:5...], [fill(false, 4); true])
5-element NullableArrays.NullableArray{Int64,1}:
 Nullable(1)      
 Nullable(2)      
 Nullable(3)      
 Nullable(4)      
 Nullable{Int64}()

julia> reduce(op, X)
Nullable{Float64}()

julia> X.values[5] = 1
1

julia> X[5]
Nullable{Int64}()

julia> reduce(op, X)
Nullable{Int64}()

Arguably, any user who writes such a whimsical function probably deserves equally whimsical behavior when reducing with that function. That being said, I think this behavior is less than desirable, since it is reasonable to expect that computing the return type of an empty Nullable oughtn't to depend on the values of null Nullables involved in the computation.

Note that this issue only concerns instances where one calls a reducing method with skipnull=false, which is the default. If skipnull is set to true, then the "values" of null entries don't affect any aspect of the resultant Nullable object.

One alternative is to forego the reducing computations entirely in the presence of null entries and simply return Nullable{T}(), where T is computed in a way that does not involve the values of null entries. Obvious candidates are to set T equal to

eltype(X)
Base.return_types(f, (eltype(X), eltype(X)))
eltype(reduce(op, X, skipnull=true)) (i.e. the eltype of the non-null Nullable that would have been returned had reduce been called with skipnull=true)
Union{}

Thoughts about these options:

eltype(X) totally ignores the fact that the elements of X are being reduced.
Relying on type inference to dictate the return type of empty Nullables incurs all of the problems mentioned in the issue that @nalimilan referenced.
Probably the most accurate, definitely the most expensive of these four options.
The thing about Union{} is that it may introduce performance problems down the line and it also doesn't capitalize on the information that is available about X and op. The latter point is also the reason for which the issue concerning comprehensions over empty arrays isn't entirely analogous.

Some more thoughts:

This issue also comes up for mapreduce, since one can imagine equally pathological functions passed for the map part.
Note that DataArrays does not experience this problem at all; in cases such as the above, reduce methods simply return NA. That's why the current performance discrepancies show up in methods like:

Method: prod(A) (~half missing entries, skip=false)
  for NullableArray{Float64}:    24.736 milliseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:        24.600 microseconds (1 allocation: 80 bytes)

A limited subset of reducing operations, such as comparison operators, can just use eltype(X). I've implemented this where possible, hence

Method: minimum(A) (~half missing entries, skip=false)
  for NullableArray{Float64}:    13.752 microseconds (2 allocations: 112 bytes)
  for DataArray{Float64}:         3.793 microseconds (1 allocation: 80 bytes)

Comparing

Method: sum(f, A) (~half missing entries, skip=false)
  for NullableArray{Float64}:   504.233 milliseconds (15008 k allocations: 458 MB, 28.99% gc time)
  for DataArray{Float64}:       286.009 milliseconds (5009 k allocations: 78259 KB, 7.37% gc time)

and

Method: sum(f, A) (~half missing entries, skip=true)
  for NullableArray{Float64}:   236.735 milliseconds (7542 k allocations: 115 MB, 11.03% gc time)
  for DataArray{Float64}:       208.877 milliseconds (7517 k allocations: 115 MB, 14.66% gc time)

seems to provide good reason at least to adopt option 3 for now, since, at least based on these numbers, this ought to perform much better than the current implementation and possibly on a par with the DataArrays implementation.

JuliaStats / NullableArrays.jl

Mapreduce performance #35