Closed quinnj closed 9 years ago
Thanks for filling an issue!
What is the difference between what you're requesting and the following method? https://github.com/johnmyleswhite/NullableArrays.jl/blob/master/src/03_primitives.jl#L118
Also, what precisely do you mean "traverse"? Would a
isnull{T}(X::NullableArray, r::UnitRange{T})
method be relevant?
Ah, perfect; yeah, I didn't think to look in the primitives file for this. I'm not sure if a range indexed isnull
would really be that useful, the use case I imagine to be able to SIMD vectorize over a NullableArray is
using NullableArrays
function vectorized2(n)
A = NullableArrays.NullableArray(rand(n))
s = 0.0
@simd for i = 1:length(A)
@inbounds s += ifelse(isnull(A,i),0.0,NullableArrays.unsafe_getvalue_notnull(A,i))
end
return s
end
@code_llvm vectorized2(100)
The interface we may want to consider here would be a method definition like
getindex{T}(A::NullableArray{T}, i, x::T) = ifelse(isnull(A,i),x,NullableArrays.unsafe_getvalue_notnull(A,i))
Similar to the Base Nullable get(n,x)
method.
That does seem like it'd be useful. I'd like to separate indexing methods that don't return a Nullable
object from getindex
, so I would call that method getvalue
instead.
+1
Do we know what the perf gains are from exposing these methods relative to returning an immutable? Seems like it could be large for not isbits
types, but small for isbits
types.
@johnmyleswhite Returning an immutable from where?
My understanding is that Jacob is using this method because @simd
doesn't kick in if the indexing method employed in the inner loop returns a Nullable
. Are you asking about comparing the performance of such an @simd
enabled loop and a loop employing a different indexing method?
I think I'm confused. I thought the reason that SIMD didn't work was because of branching, but I would think you could work around that with some unsafe indexing into a scalar Nullable
. Is that not right?
In some preliminary testing on Wednesday, we couldn't get SIMD to work when returning a Nullable. Using my code above, however, allows the vectorization.
For what it's worth
using NullableArrays
A = rand(Float64, 5_000)
X = NullableArray(A)
function f(X::NullableArray)
res = 0.0
for i in 1:length(X)
res += X[i].value
end
return res
end
@code_llvm(f(X))
returns
define double @julia_f_22418(%jl_value_t*) {
top:
%1 = alloca [3 x %jl_value_t*], align 8
%.sub = getelementptr inbounds [3 x %jl_value_t*]* %1, i64 0, i64 0
%2 = getelementptr [3 x %jl_value_t*]* %1, i64 0, i64 2
store %jl_value_t* inttoptr (i64 2 to %jl_value_t*), %jl_value_t** %.sub, align 8
%3 = load %jl_value_t*** @jl_pgcstack, align 8
%4 = getelementptr [3 x %jl_value_t*]* %1, i64 0, i64 1
%.c = bitcast %jl_value_t** %3 to %jl_value_t*
store %jl_value_t* %.c, %jl_value_t** %4, align 8
store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8
store %jl_value_t* null, %jl_value_t** %2, align 8
%5 = getelementptr inbounds %jl_value_t* %0, i64 0, i32 0
%6 = load %jl_value_t** %5, align 8
%7 = getelementptr inbounds %jl_value_t* %6, i64 1
%8 = bitcast %jl_value_t* %7 to i64*
%9 = load i64* %8, align 8
%10 = icmp sgt i64 %9, 0
%11 = select i1 %10, i64 %9, i64 0
%12 = icmp eq i64 %11, 0
br i1 %12, label %L11, label %L.preheader
L.preheader: ; preds = %top
%13 = getelementptr inbounds %jl_value_t* %0, i64 1, i32 0
%14 = load %jl_value_t** %13, align 8
%15 = getelementptr inbounds %jl_value_t* %14, i64 1
%16 = bitcast %jl_value_t* %15 to i64*
%17 = load i64* %16, align 8
%18 = bitcast %jl_value_t* %14 to i8**
br label %L
L: ; preds = %L8, %L.preheader
%"#s16.0" = phi i64 [ %39, %L8 ], [ 1, %L.preheader ]
%res.0 = phi double [ %40, %L8 ], [ 0.000000e+00, %L.preheader ]
%19 = add i64 %"#s16.0", -1
%20 = icmp ult i64 %19, %17
br i1 %20, label %idxend, label %oob
oob: ; preds = %L
%21 = alloca i64, align 8
store i64 %"#s16.0", i64* %21, align 8
call void @jl_bounds_error_ints(%jl_value_t* %14, i64* %21, i64 1)
unreachable
idxend: ; preds = %L
%22 = load i8** %18, align 8
%23 = getelementptr i8* %22, i64 %19
%24 = load i8* %23, align 1
%25 = and i8 %24, 1
%26 = icmp eq i8 %25, 0
br i1 %26, label %L3, label %L8
L3: ; preds = %idxend
%27 = load %jl_value_t** %5, align 8
%28 = getelementptr inbounds %jl_value_t* %27, i64 1
%29 = bitcast %jl_value_t* %28 to i64*
%30 = load i64* %29, align 8
%31 = icmp ult i64 %19, %30
br i1 %31, label %idxend5, label %oob4
oob4: ; preds = %L3
%32 = alloca i64, align 8
store i64 %"#s16.0", i64* %32, align 8
call void @jl_bounds_error_ints(%jl_value_t* %27, i64* %32, i64 1)
unreachable
idxend5: ; preds = %L3
%33 = bitcast %jl_value_t* %27 to i8**
%34 = load i8** %33, align 8
%35 = bitcast i8* %34 to double*
%36 = getelementptr double* %35, i64 %19
%37 = load double* %36, align 8
%38 = insertvalue %Nullable.9 { i8 0, double undef }, double %37, 1
br label %L8
L8: ; preds = %idxend5, %idxend
%storemerge = phi %Nullable.9 [ %38, %idxend5 ], [ { i8 1, double undef }, %idxend ]
%39 = add i64 %"#s16.0", 1
%storemerge13 = extractvalue %Nullable.9 %storemerge, 1
%40 = fadd double %res.0, %storemerge13
%41 = icmp eq i64 %"#s16.0", %11
br i1 %41, label %L11, label %L
L11: ; preds = %L8, %top
%res.1 = phi double [ 0.000000e+00, %top ], [ %40, %L8 ]
%42 = load %jl_value_t** %4, align 8
%43 = getelementptr inbounds %jl_value_t* %42, i64 0, i32 0
store %jl_value_t** %43, %jl_value_t*** @jl_pgcstack, align 8
ret double %res.1
}
Jacob speculated that it was the need to allocate when creating the Nullable
object to return from getindex
that precludes vectorization.
Oh wait, I didn't call @simd
. That could also do it.
julia> function f(X::NullableArray)
res = 0.0
@simd for i in 1:length(X)
@inbounds res += X[i].value
end
return res
end
f (generic function with 1 method)
julia> @code_llvm(f(X))
define double @julia_f_22419(%jl_value_t*) {
L:
%1 = alloca [3 x %jl_value_t*], align 8
%.sub = getelementptr inbounds [3 x %jl_value_t*]* %1, i64 0, i64 0
%2 = getelementptr [3 x %jl_value_t*]* %1, i64 0, i64 2
store %jl_value_t* inttoptr (i64 2 to %jl_value_t*), %jl_value_t** %.sub, align 8
%3 = load %jl_value_t*** @jl_pgcstack, align 8
%4 = getelementptr [3 x %jl_value_t*]* %1, i64 0, i64 1
%.c = bitcast %jl_value_t** %3 to %jl_value_t*
store %jl_value_t* %.c, %jl_value_t** %4, align 8
store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8
store %jl_value_t* null, %jl_value_t** %2, align 8
%5 = getelementptr inbounds %jl_value_t* %0, i64 0, i32 0
%6 = load %jl_value_t** %5, align 8
%7 = getelementptr inbounds %jl_value_t* %6, i64 1
%8 = bitcast %jl_value_t* %7 to i64*
%9 = load i64* %8, align 8
%10 = icmp sgt i64 %9, 0
%11 = select i1 %10, i64 %9, i64 0
%12 = getelementptr inbounds %jl_value_t* %0, i64 1, i32 0
%13 = call { i64, i1 } @llvm.ssub.with.overflow.i64(i64 %11, i64 1)
%14 = extractvalue { i64, i1 } %13, 1
br i1 %14, label %fail, label %pass
fail: ; preds = %L
%15 = load %jl_value_t** @jl_overflow_exception, align 8
call void @jl_throw_with_superfluous_argument(%jl_value_t* %15, i32 67)
unreachable
pass: ; preds = %L
%16 = extractvalue { i64, i1 } %13, 0
%17 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %16, i64 1)
%18 = extractvalue { i64, i1 } %17, 1
br i1 %18, label %fail1, label %pass2
fail1: ; preds = %pass
%19 = load %jl_value_t** @jl_overflow_exception, align 8
call void @jl_throw_with_superfluous_argument(%jl_value_t* %19, i32 67)
unreachable
pass2: ; preds = %pass
%20 = extractvalue { i64, i1 } %17, 0
%21 = icmp slt i64 %20, 1
br i1 %21, label %L17, label %if3
if3: ; preds = %pass2
%22 = load %jl_value_t** %12, align 8
%23 = bitcast %jl_value_t* %22 to i8**
%24 = load i8** %23, align 8
br label %L5
L5: ; preds = %L11, %if3
%"##i#10662.0" = phi i64 [ 0, %if3 ], [ %37, %L11 ]
%res.1 = phi double [ 0.000000e+00, %if3 ], [ %36, %L11 ]
%25 = getelementptr i8* %24, i64 %"##i#10662.0"
%26 = load i8* %25, align 1
%27 = and i8 %26, 1
%28 = icmp eq i8 %27, 0
br i1 %28, label %L8, label %L11
L8: ; preds = %L5
%29 = load %jl_value_t** %5, align 8
%30 = bitcast %jl_value_t* %29 to i8**
%31 = load i8** %30, align 8
%32 = bitcast i8* %31 to double*
%33 = getelementptr double* %32, i64 %"##i#10662.0"
%34 = load double* %33, align 8
%35 = insertvalue %Nullable.9 { i8 0, double undef }, double %34, 1
br label %L11
L11: ; preds = %L8, %L5
%storemerge = phi %Nullable.9 [ %35, %L8 ], [ { i8 1, double undef }, %L5 ]
%storemerge19 = extractvalue %Nullable.9 %storemerge, 1
%36 = fadd fast double %res.1, %storemerge19
%37 = add i64 %"##i#10662.0", 1
%exitcond = icmp eq i64 %37, %20
br i1 %exitcond, label %L17, label %L5
L17: ; preds = %L11, %pass2
%res.3 = phi double [ 0.000000e+00, %pass2 ], [ %36, %L11 ]
%38 = load %jl_value_t** %4, align 8
%39 = getelementptr inbounds %jl_value_t* %38, i64 0, i32 0
store %jl_value_t** %39, %jl_value_t*** @jl_pgcstack, align 8
ret double %res.3
}
still doesn't vectorize (right?).
julia> X = rand(Float64, 5_000)
5000-element Array{Float64,1}:
0.219656
0.841695
0.521038
0.0601669
0.542906
0.136611
0.63456
0.0106102
0.149608
0.460036
⋮
0.804701
0.804945
0.58259
0.475501
0.708776
0.836339
0.374256
0.298452
0.230938
julia> function f(X::Array)
res = 0.0
@simd for i in 1:length(X)
@inbounds res += X[i]
end
return res
end
f (generic function with 1 method)
julia> @code_llvm(f(X))
define double @julia_f_21563(%jl_value_t*) {
L:
%1 = getelementptr inbounds %jl_value_t* %0, i64 1
%2 = bitcast %jl_value_t* %1 to i64*
%3 = load i64* %2, align 8
%4 = icmp sgt i64 %3, 0
%5 = select i1 %4, i64 %3, i64 0
%6 = bitcast %jl_value_t* %0 to i8**
%7 = call { i64, i1 } @llvm.ssub.with.overflow.i64(i64 %5, i64 1)
%8 = extractvalue { i64, i1 } %7, 1
br i1 %8, label %fail, label %pass
fail: ; preds = %L
%9 = load %jl_value_t** @jl_overflow_exception, align 8
call void @jl_throw_with_superfluous_argument(%jl_value_t* %9, i32 67)
unreachable
pass: ; preds = %L
%10 = extractvalue { i64, i1 } %7, 0
%11 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %10, i64 1)
%12 = extractvalue { i64, i1 } %11, 1
br i1 %12, label %fail1, label %pass2
fail1: ; preds = %pass
%13 = load %jl_value_t** @jl_overflow_exception, align 8
call void @jl_throw_with_superfluous_argument(%jl_value_t* %13, i32 67)
unreachable
pass2: ; preds = %pass
%14 = extractvalue { i64, i1 } %11, 0
%15 = icmp slt i64 %14, 1
br i1 %15, label %L11, label %if3
if3: ; preds = %pass2
%16 = load i8** %6, align 8
%17 = bitcast i8* %16 to double*
%n.vec = and i64 %14, -16
%cmp.zero = icmp eq i64 %n.vec, 0
br i1 %cmp.zero, label %middle.block, label %vector.body
vector.body: ; preds = %vector.body, %if3
%index = phi i64 [ %index.next, %vector.body ], [ 0, %if3 ]
%vec.phi = phi <4 x double> [ %26, %vector.body ], [ zeroinitializer, %if3 ]
%vec.phi15 = phi <4 x double> [ %27, %vector.body ], [ zeroinitializer, %if3 ]
%vec.phi16 = phi <4 x double> [ %28, %vector.body ], [ zeroinitializer, %if3 ]
%vec.phi17 = phi <4 x double> [ %29, %vector.body ], [ zeroinitializer, %if3 ]
%18 = getelementptr double* %17, i64 %index
%19 = bitcast double* %18 to <4 x double>*
%wide.load = load <4 x double>* %19, align 8
%.sum31 = or i64 %index, 4
%20 = getelementptr double* %17, i64 %.sum31
%21 = bitcast double* %20 to <4 x double>*
%wide.load18 = load <4 x double>* %21, align 8
%.sum32 = or i64 %index, 8
%22 = getelementptr double* %17, i64 %.sum32
%23 = bitcast double* %22 to <4 x double>*
%wide.load19 = load <4 x double>* %23, align 8
%.sum33 = or i64 %index, 12
%24 = getelementptr double* %17, i64 %.sum33
%25 = bitcast double* %24 to <4 x double>*
%wide.load20 = load <4 x double>* %25, align 8
%26 = fadd <4 x double> %vec.phi, %wide.load
%27 = fadd <4 x double> %vec.phi15, %wide.load18
%28 = fadd <4 x double> %vec.phi16, %wide.load19
%29 = fadd <4 x double> %vec.phi17, %wide.load20
%index.next = add i64 %index, 16
%30 = icmp eq i64 %index.next, %n.vec
br i1 %30, label %middle.block, label %vector.body
middle.block: ; preds = %vector.body, %if3
%resume.val = phi i64 [ 0, %if3 ], [ %n.vec, %vector.body ]
%rdx.vec.exit.phi = phi <4 x double> [ zeroinitializer, %if3 ], [ %26, %vector.body ]
%rdx.vec.exit.phi23 = phi <4 x double> [ zeroinitializer, %if3 ], [ %27, %vector.body ]
%rdx.vec.exit.phi24 = phi <4 x double> [ zeroinitializer, %if3 ], [ %28, %vector.body ]
%rdx.vec.exit.phi25 = phi <4 x double> [ zeroinitializer, %if3 ], [ %29, %vector.body ]
%bin.rdx = fadd <4 x double> %rdx.vec.exit.phi23, %rdx.vec.exit.phi
%bin.rdx26 = fadd <4 x double> %rdx.vec.exit.phi24, %bin.rdx
%bin.rdx27 = fadd <4 x double> %rdx.vec.exit.phi25, %bin.rdx26
%rdx.shuf = shufflevector <4 x double> %bin.rdx27, <4 x double> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
%bin.rdx28 = fadd <4 x double> %bin.rdx27, %rdx.shuf
%rdx.shuf29 = shufflevector <4 x double> %bin.rdx28, <4 x double> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
%bin.rdx30 = fadd <4 x double> %bin.rdx28, %rdx.shuf29
%31 = extractelement <4 x double> %bin.rdx30, i32 0
%cmp.n = icmp eq i64 %14, %resume.val
br i1 %cmp.n, label %L11, label %L5
L5: ; preds = %L5, %middle.block
%"##i#8487.0" = phi i64 [ %35, %L5 ], [ %resume.val, %middle.block ]
%res.1 = phi double [ %34, %L5 ], [ %31, %middle.block ]
%32 = getelementptr double* %17, i64 %"##i#8487.0"
%33 = load double* %32, align 8
%34 = fadd fast double %res.1, %33
%35 = add i64 %"##i#8487.0", 1
%exitcond = icmp eq i64 %35, %14
br i1 %exitcond, label %L11, label %L5
L11: ; preds = %L5, %middle.block, %pass2
%res.3 = phi double [ 0.000000e+00, %pass2 ], [ %34, %L5 ], [ %31, %middle.block ]
ret double %res.3
}
I wonder if this is something that Base could help us with. I don't really understand why Nullable{Float64}
needs any memory allocation.
Indeed, if that's even the reason (I may be misrepresenting Jacob's thought).
It's not a crazy allocation, but there is a cost to constructing a Nullable
type and grabbing an extra Bool
. The reasoning for my code pattern above is for cases where you'll be working over the values of a NullableArray and have a good "default" value for null values.
Yeah, I'm on board with this interface. But it still seems really weird that there's any allocations since I would think you should always be able to keep the current nullable on the stack if it's wrapping a bitstype.
My current take-away: what's wrong is actually that getindex
has a branch in it.
Given some improvements in this package and a tentative change in Base, we might not need this method for speed.
+1 to removing the method
I'm becoming more in favor of removing it.
If its only conceivable use is in non-standard indexing methods, then +1. I can't really think of any other use for it ATM.
The one conceivable use is making it easy to pull out the mask without even accessing the values array, but that exposes enough implementation details that it seems reasonable to just tell people to work with the internals if they need that kind of detail.
Done. Easy enough to revert if we change our minds.
This would be useful for cases where I want to traverse a NullableArray.