JuliaStats / NullableArrays.jl

DEPRECATED Prototype of the new JuliaStats NullableArrays package
Other
35 stars 21 forks source link

Remove the `isnull(A::NullableArray,I::Int...)` method? #22

Closed quinnj closed 9 years ago

quinnj commented 9 years ago

This would be useful for cases where I want to traverse a NullableArray.

davidagold commented 9 years ago

Thanks for filling an issue!

What is the difference between what you're requesting and the following method? https://github.com/johnmyleswhite/NullableArrays.jl/blob/master/src/03_primitives.jl#L118

Also, what precisely do you mean "traverse"? Would a

isnull{T}(X::NullableArray, r::UnitRange{T})

method be relevant?

quinnj commented 9 years ago

Ah, perfect; yeah, I didn't think to look in the primitives file for this. I'm not sure if a range indexed isnull would really be that useful, the use case I imagine to be able to SIMD vectorize over a NullableArray is

using NullableArrays
function vectorized2(n)
    A = NullableArrays.NullableArray(rand(n))
    s = 0.0
    @simd for i = 1:length(A)
        @inbounds s += ifelse(isnull(A,i),0.0,NullableArrays.unsafe_getvalue_notnull(A,i))
    end
    return s
end
@code_llvm vectorized2(100)

The interface we may want to consider here would be a method definition like

getindex{T}(A::NullableArray{T}, i, x::T) = ifelse(isnull(A,i),x,NullableArrays.unsafe_getvalue_notnull(A,i))

Similar to the Base Nullable get(n,x) method.

davidagold commented 9 years ago

That does seem like it'd be useful. I'd like to separate indexing methods that don't return a Nullable object from getindex, so I would call that method getvalue instead.

quinnj commented 9 years ago

+1

johnmyleswhite commented 9 years ago

Do we know what the perf gains are from exposing these methods relative to returning an immutable? Seems like it could be large for not isbits types, but small for isbits types.

davidagold commented 9 years ago

@johnmyleswhite Returning an immutable from where?

My understanding is that Jacob is using this method because @simd doesn't kick in if the indexing method employed in the inner loop returns a Nullable. Are you asking about comparing the performance of such an @simd enabled loop and a loop employing a different indexing method?

johnmyleswhite commented 9 years ago

I think I'm confused. I thought the reason that SIMD didn't work was because of branching, but I would think you could work around that with some unsafe indexing into a scalar Nullable. Is that not right?

quinnj commented 9 years ago

In some preliminary testing on Wednesday, we couldn't get SIMD to work when returning a Nullable. Using my code above, however, allows the vectorization.

davidagold commented 9 years ago

For what it's worth

using NullableArrays

A = rand(Float64, 5_000)
X = NullableArray(A)

function f(X::NullableArray)
    res = 0.0
    for i in 1:length(X)
        res += X[i].value
    end
    return res
end

@code_llvm(f(X))

returns

define double @julia_f_22418(%jl_value_t*) {
top:
  %1 = alloca [3 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [3 x %jl_value_t*]* %1, i64 0, i64 0
  %2 = getelementptr [3 x %jl_value_t*]* %1, i64 0, i64 2
  store %jl_value_t* inttoptr (i64 2 to %jl_value_t*), %jl_value_t** %.sub, align 8
  %3 = load %jl_value_t*** @jl_pgcstack, align 8
  %4 = getelementptr [3 x %jl_value_t*]* %1, i64 0, i64 1
  %.c = bitcast %jl_value_t** %3 to %jl_value_t*
  store %jl_value_t* %.c, %jl_value_t** %4, align 8
  store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8
  store %jl_value_t* null, %jl_value_t** %2, align 8
  %5 = getelementptr inbounds %jl_value_t* %0, i64 0, i32 0
  %6 = load %jl_value_t** %5, align 8
  %7 = getelementptr inbounds %jl_value_t* %6, i64 1
  %8 = bitcast %jl_value_t* %7 to i64*
  %9 = load i64* %8, align 8
  %10 = icmp sgt i64 %9, 0
  %11 = select i1 %10, i64 %9, i64 0
  %12 = icmp eq i64 %11, 0
  br i1 %12, label %L11, label %L.preheader

L.preheader:                                      ; preds = %top
  %13 = getelementptr inbounds %jl_value_t* %0, i64 1, i32 0
  %14 = load %jl_value_t** %13, align 8
  %15 = getelementptr inbounds %jl_value_t* %14, i64 1
  %16 = bitcast %jl_value_t* %15 to i64*
  %17 = load i64* %16, align 8
  %18 = bitcast %jl_value_t* %14 to i8**
  br label %L

L:                                                ; preds = %L8, %L.preheader
  %"#s16.0" = phi i64 [ %39, %L8 ], [ 1, %L.preheader ]
  %res.0 = phi double [ %40, %L8 ], [ 0.000000e+00, %L.preheader ]
  %19 = add i64 %"#s16.0", -1
  %20 = icmp ult i64 %19, %17
  br i1 %20, label %idxend, label %oob

oob:                                              ; preds = %L
  %21 = alloca i64, align 8
  store i64 %"#s16.0", i64* %21, align 8
  call void @jl_bounds_error_ints(%jl_value_t* %14, i64* %21, i64 1)
  unreachable

idxend:                                           ; preds = %L
  %22 = load i8** %18, align 8
  %23 = getelementptr i8* %22, i64 %19
  %24 = load i8* %23, align 1
  %25 = and i8 %24, 1
  %26 = icmp eq i8 %25, 0
  br i1 %26, label %L3, label %L8

L3:                                               ; preds = %idxend
  %27 = load %jl_value_t** %5, align 8
  %28 = getelementptr inbounds %jl_value_t* %27, i64 1
  %29 = bitcast %jl_value_t* %28 to i64*
  %30 = load i64* %29, align 8
  %31 = icmp ult i64 %19, %30
  br i1 %31, label %idxend5, label %oob4

oob4:                                             ; preds = %L3
  %32 = alloca i64, align 8
  store i64 %"#s16.0", i64* %32, align 8
  call void @jl_bounds_error_ints(%jl_value_t* %27, i64* %32, i64 1)
  unreachable

idxend5:                                          ; preds = %L3
  %33 = bitcast %jl_value_t* %27 to i8**
  %34 = load i8** %33, align 8
  %35 = bitcast i8* %34 to double*
  %36 = getelementptr double* %35, i64 %19
  %37 = load double* %36, align 8
  %38 = insertvalue %Nullable.9 { i8 0, double undef }, double %37, 1
  br label %L8

L8:                                               ; preds = %idxend5, %idxend
  %storemerge = phi %Nullable.9 [ %38, %idxend5 ], [ { i8 1, double undef }, %idxend ]
  %39 = add i64 %"#s16.0", 1
  %storemerge13 = extractvalue %Nullable.9 %storemerge, 1
  %40 = fadd double %res.0, %storemerge13
  %41 = icmp eq i64 %"#s16.0", %11
  br i1 %41, label %L11, label %L

L11:                                              ; preds = %L8, %top
  %res.1 = phi double [ 0.000000e+00, %top ], [ %40, %L8 ]
  %42 = load %jl_value_t** %4, align 8
  %43 = getelementptr inbounds %jl_value_t* %42, i64 0, i32 0
  store %jl_value_t** %43, %jl_value_t*** @jl_pgcstack, align 8
  ret double %res.1
}

Jacob speculated that it was the need to allocate when creating the Nullable object to return from getindex that precludes vectorization.

davidagold commented 9 years ago

Oh wait, I didn't call @simd. That could also do it.

davidagold commented 9 years ago
julia> function f(X::NullableArray)
           res = 0.0
           @simd for i in 1:length(X)
               @inbounds res += X[i].value
           end
           return res
       end
f (generic function with 1 method)

julia> @code_llvm(f(X))

define double @julia_f_22419(%jl_value_t*) {
L:
  %1 = alloca [3 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [3 x %jl_value_t*]* %1, i64 0, i64 0
  %2 = getelementptr [3 x %jl_value_t*]* %1, i64 0, i64 2
  store %jl_value_t* inttoptr (i64 2 to %jl_value_t*), %jl_value_t** %.sub, align 8
  %3 = load %jl_value_t*** @jl_pgcstack, align 8
  %4 = getelementptr [3 x %jl_value_t*]* %1, i64 0, i64 1
  %.c = bitcast %jl_value_t** %3 to %jl_value_t*
  store %jl_value_t* %.c, %jl_value_t** %4, align 8
  store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8
  store %jl_value_t* null, %jl_value_t** %2, align 8
  %5 = getelementptr inbounds %jl_value_t* %0, i64 0, i32 0
  %6 = load %jl_value_t** %5, align 8
  %7 = getelementptr inbounds %jl_value_t* %6, i64 1
  %8 = bitcast %jl_value_t* %7 to i64*
  %9 = load i64* %8, align 8
  %10 = icmp sgt i64 %9, 0
  %11 = select i1 %10, i64 %9, i64 0
  %12 = getelementptr inbounds %jl_value_t* %0, i64 1, i32 0
  %13 = call { i64, i1 } @llvm.ssub.with.overflow.i64(i64 %11, i64 1)
  %14 = extractvalue { i64, i1 } %13, 1
  br i1 %14, label %fail, label %pass

fail:                                             ; preds = %L
  %15 = load %jl_value_t** @jl_overflow_exception, align 8
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %15, i32 67)
  unreachable

pass:                                             ; preds = %L
  %16 = extractvalue { i64, i1 } %13, 0
  %17 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %16, i64 1)
  %18 = extractvalue { i64, i1 } %17, 1
  br i1 %18, label %fail1, label %pass2

fail1:                                            ; preds = %pass
  %19 = load %jl_value_t** @jl_overflow_exception, align 8
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %19, i32 67)
  unreachable

pass2:                                            ; preds = %pass
  %20 = extractvalue { i64, i1 } %17, 0
  %21 = icmp slt i64 %20, 1
  br i1 %21, label %L17, label %if3

if3:                                              ; preds = %pass2
  %22 = load %jl_value_t** %12, align 8
  %23 = bitcast %jl_value_t* %22 to i8**
  %24 = load i8** %23, align 8
  br label %L5

L5:                                               ; preds = %L11, %if3
  %"##i#10662.0" = phi i64 [ 0, %if3 ], [ %37, %L11 ]
  %res.1 = phi double [ 0.000000e+00, %if3 ], [ %36, %L11 ]
  %25 = getelementptr i8* %24, i64 %"##i#10662.0"
  %26 = load i8* %25, align 1
  %27 = and i8 %26, 1
  %28 = icmp eq i8 %27, 0
  br i1 %28, label %L8, label %L11

L8:                                               ; preds = %L5
  %29 = load %jl_value_t** %5, align 8
  %30 = bitcast %jl_value_t* %29 to i8**
  %31 = load i8** %30, align 8
  %32 = bitcast i8* %31 to double*
  %33 = getelementptr double* %32, i64 %"##i#10662.0"
  %34 = load double* %33, align 8
  %35 = insertvalue %Nullable.9 { i8 0, double undef }, double %34, 1
  br label %L11

L11:                                              ; preds = %L8, %L5
  %storemerge = phi %Nullable.9 [ %35, %L8 ], [ { i8 1, double undef }, %L5 ]
  %storemerge19 = extractvalue %Nullable.9 %storemerge, 1
  %36 = fadd fast double %res.1, %storemerge19
  %37 = add i64 %"##i#10662.0", 1
  %exitcond = icmp eq i64 %37, %20
  br i1 %exitcond, label %L17, label %L5

L17:                                              ; preds = %L11, %pass2
  %res.3 = phi double [ 0.000000e+00, %pass2 ], [ %36, %L11 ]
  %38 = load %jl_value_t** %4, align 8
  %39 = getelementptr inbounds %jl_value_t* %38, i64 0, i32 0
  store %jl_value_t** %39, %jl_value_t*** @jl_pgcstack, align 8
  ret double %res.3
}

still doesn't vectorize (right?).

johnmyleswhite commented 9 years ago
julia> X = rand(Float64, 5_000)
5000-element Array{Float64,1}:
 0.219656 
 0.841695 
 0.521038 
 0.0601669
 0.542906 
 0.136611 
 0.63456  
 0.0106102
 0.149608 
 0.460036 
 ⋮        
 0.804701 
 0.804945 
 0.58259  
 0.475501 
 0.708776 
 0.836339 
 0.374256 
 0.298452 
 0.230938 

julia> function f(X::Array)
           res = 0.0
           @simd for i in 1:length(X)
               @inbounds res += X[i]
           end
           return res
       end
f (generic function with 1 method)

julia> @code_llvm(f(X))

define double @julia_f_21563(%jl_value_t*) {
L:
  %1 = getelementptr inbounds %jl_value_t* %0, i64 1
  %2 = bitcast %jl_value_t* %1 to i64*
  %3 = load i64* %2, align 8
  %4 = icmp sgt i64 %3, 0
  %5 = select i1 %4, i64 %3, i64 0
  %6 = bitcast %jl_value_t* %0 to i8**
  %7 = call { i64, i1 } @llvm.ssub.with.overflow.i64(i64 %5, i64 1)
  %8 = extractvalue { i64, i1 } %7, 1
  br i1 %8, label %fail, label %pass

fail:                                             ; preds = %L
  %9 = load %jl_value_t** @jl_overflow_exception, align 8
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %9, i32 67)
  unreachable

pass:                                             ; preds = %L
  %10 = extractvalue { i64, i1 } %7, 0
  %11 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %10, i64 1)
  %12 = extractvalue { i64, i1 } %11, 1
  br i1 %12, label %fail1, label %pass2

fail1:                                            ; preds = %pass
  %13 = load %jl_value_t** @jl_overflow_exception, align 8
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %13, i32 67)
  unreachable

pass2:                                            ; preds = %pass
  %14 = extractvalue { i64, i1 } %11, 0
  %15 = icmp slt i64 %14, 1
  br i1 %15, label %L11, label %if3

if3:                                              ; preds = %pass2
  %16 = load i8** %6, align 8
  %17 = bitcast i8* %16 to double*
  %n.vec = and i64 %14, -16
  %cmp.zero = icmp eq i64 %n.vec, 0
  br i1 %cmp.zero, label %middle.block, label %vector.body

vector.body:                                      ; preds = %vector.body, %if3
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %if3 ]
  %vec.phi = phi <4 x double> [ %26, %vector.body ], [ zeroinitializer, %if3 ]
  %vec.phi15 = phi <4 x double> [ %27, %vector.body ], [ zeroinitializer, %if3 ]
  %vec.phi16 = phi <4 x double> [ %28, %vector.body ], [ zeroinitializer, %if3 ]
  %vec.phi17 = phi <4 x double> [ %29, %vector.body ], [ zeroinitializer, %if3 ]
  %18 = getelementptr double* %17, i64 %index
  %19 = bitcast double* %18 to <4 x double>*
  %wide.load = load <4 x double>* %19, align 8
  %.sum31 = or i64 %index, 4
  %20 = getelementptr double* %17, i64 %.sum31
  %21 = bitcast double* %20 to <4 x double>*
  %wide.load18 = load <4 x double>* %21, align 8
  %.sum32 = or i64 %index, 8
  %22 = getelementptr double* %17, i64 %.sum32
  %23 = bitcast double* %22 to <4 x double>*
  %wide.load19 = load <4 x double>* %23, align 8
  %.sum33 = or i64 %index, 12
  %24 = getelementptr double* %17, i64 %.sum33
  %25 = bitcast double* %24 to <4 x double>*
  %wide.load20 = load <4 x double>* %25, align 8
  %26 = fadd <4 x double> %vec.phi, %wide.load
  %27 = fadd <4 x double> %vec.phi15, %wide.load18
  %28 = fadd <4 x double> %vec.phi16, %wide.load19
  %29 = fadd <4 x double> %vec.phi17, %wide.load20
  %index.next = add i64 %index, 16
  %30 = icmp eq i64 %index.next, %n.vec
  br i1 %30, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body, %if3
  %resume.val = phi i64 [ 0, %if3 ], [ %n.vec, %vector.body ]
  %rdx.vec.exit.phi = phi <4 x double> [ zeroinitializer, %if3 ], [ %26, %vector.body ]
  %rdx.vec.exit.phi23 = phi <4 x double> [ zeroinitializer, %if3 ], [ %27, %vector.body ]
  %rdx.vec.exit.phi24 = phi <4 x double> [ zeroinitializer, %if3 ], [ %28, %vector.body ]
  %rdx.vec.exit.phi25 = phi <4 x double> [ zeroinitializer, %if3 ], [ %29, %vector.body ]
  %bin.rdx = fadd <4 x double> %rdx.vec.exit.phi23, %rdx.vec.exit.phi
  %bin.rdx26 = fadd <4 x double> %rdx.vec.exit.phi24, %bin.rdx
  %bin.rdx27 = fadd <4 x double> %rdx.vec.exit.phi25, %bin.rdx26
  %rdx.shuf = shufflevector <4 x double> %bin.rdx27, <4 x double> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
  %bin.rdx28 = fadd <4 x double> %bin.rdx27, %rdx.shuf
  %rdx.shuf29 = shufflevector <4 x double> %bin.rdx28, <4 x double> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
  %bin.rdx30 = fadd <4 x double> %bin.rdx28, %rdx.shuf29
  %31 = extractelement <4 x double> %bin.rdx30, i32 0
  %cmp.n = icmp eq i64 %14, %resume.val
  br i1 %cmp.n, label %L11, label %L5

L5:                                               ; preds = %L5, %middle.block
  %"##i#8487.0" = phi i64 [ %35, %L5 ], [ %resume.val, %middle.block ]
  %res.1 = phi double [ %34, %L5 ], [ %31, %middle.block ]
  %32 = getelementptr double* %17, i64 %"##i#8487.0"
  %33 = load double* %32, align 8
  %34 = fadd fast double %res.1, %33
  %35 = add i64 %"##i#8487.0", 1
  %exitcond = icmp eq i64 %35, %14
  br i1 %exitcond, label %L11, label %L5

L11:                                              ; preds = %L5, %middle.block, %pass2
  %res.3 = phi double [ 0.000000e+00, %pass2 ], [ %34, %L5 ], [ %31, %middle.block ]
  ret double %res.3
}
johnmyleswhite commented 9 years ago

I wonder if this is something that Base could help us with. I don't really understand why Nullable{Float64} needs any memory allocation.

davidagold commented 9 years ago

Indeed, if that's even the reason (I may be misrepresenting Jacob's thought).

quinnj commented 9 years ago

It's not a crazy allocation, but there is a cost to constructing a Nullable type and grabbing an extra Bool. The reasoning for my code pattern above is for cases where you'll be working over the values of a NullableArray and have a good "default" value for null values.

johnmyleswhite commented 9 years ago

Yeah, I'm on board with this interface. But it still seems really weird that there's any allocations since I would think you should always be able to keep the current nullable on the stack if it's wrapping a bitstype.

johnmyleswhite commented 9 years ago

My current take-away: what's wrong is actually that getindex has a branch in it.

johnmyleswhite commented 9 years ago

Given some improvements in this package and a tentative change in Base, we might not need this method for speed.

quinnj commented 9 years ago

+1 to removing the method

johnmyleswhite commented 9 years ago

I'm becoming more in favor of removing it.

davidagold commented 9 years ago

If its only conceivable use is in non-standard indexing methods, then +1. I can't really think of any other use for it ATM.

johnmyleswhite commented 9 years ago

The one conceivable use is making it easy to pull out the mask without even accessing the values array, but that exposes enough implementation details that it seems reasonable to just tell people to work with the internals if they need that kind of detail.

davidagold commented 9 years ago

Done. Easy enough to revert if we change our minds.