Closed vtjnash closed 3 years ago
I think all that's needed is to add a trapping
flag (which doesn't exist yet, but should be possible to add) to indicate that the div and't be folded away unless the second argument is proved to be not zero. The Constant fold passes would of course have to be updates. I think that that approach would have an easier time going through upstream.
The trapping flag approach sounds right.
the flag part tells you half of it, i'm describing what the flag would do
IIRC, what we determined at the LLVM meetup back in July was that you need a flag to tell LLVM that it is not free to optimize away integer division exceptions unless it can prove that they won't happen. In other words, you want a div instruction such that division by zero is defined and always results in an integer division exception.
Another problem is that div
is just slow. Agner Fog's instruction tables say there is a latency of 40-103 cycles and a reciprocal throughput of 25-84 cycles for 64-bit IDIV on Sandy Bridge.
I tried implementing integer division of arrays by a scalar using this cute trick for division by multiplication and bit shifting, which is also what LLVM does when it encounters integer division by a constant. Code here. It is indeed much faster, even though we are slightly handicapped because we also emit an extra test when performing bit shifts. The test results there are for a large array, but the cutoff where it is faster to compute and apply the magic numbers than it is to use the division intrinsics is around 32 elements with Int64/Uint64. This trick could presumably also be used for faster linear indexing of array views.
I updated the gist with numbers for using the LLVM sdiv
/udiv
intrinsics directly and also fixed a bug in unsigned division. There is a cost to the checks of 15-25% in this benchmark, but part of the reason may be that the check appears to prevent both vectorization and hoisting loads of the array pointers. Still, the magic numbers are 2-3x faster.
That's really cool, @simonster! It's amazing how many lines of code can run faster than a single intrinsic.
Less surprising when it's the x86_64 CHKEML instruction that checks your email or the VCHKEML vector instruction which checks multiple email accounts at once.
@simonster, have you seen this on the front page of HN today:
http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html
Is this the same as the algorithm you're using?
I submitted that :smile:, so I've seen it, but I haven't implemented the modification yet. Should give a slight boost for the unsigned case.
Fish's method is faster, but not by much and only if we don't care about being able to divide typemax(T)
. It looks like LLVM can vectorize the classic "round-up" approach for Uint64, but not the distributive approach nor the saturating add approach from Fish's post. Non-saturating add is possible but would give the wrong results for typemax
.
This is really cool but won't this eventually work its way into LLVM? It seems like something that should be handled by the LLVM compiler and not in Julia.
in the near-term, we could see if Julia knows that the denominator is a constant != 0, and skip emitting the check
@jakebolewski, there are situations where you want to precompute magic numbers but you can't do code gen. One example is multidimensional index calculations – you'd want to store a bunch of these magic numbers in a data structure.
Although in those cases it might be good to have as uniform an algorithm as possible at the expense of generality. You almost never have to divide by a very large number with arrays and can afford to sacrifice some bits.
@jakebolewski LLVM already performs this optimization if the divisor is known at compile time. To perform the optimization for loops with division by a loop-invariant divisor that is not known at compile time, it would have to know that the optimization is profitable (for a loop with unknown trip count, this would require metadata or a branch). This doesn't seem completely unreasonable, since LLVM already has a lot of loop metadata for controlling the vectorizer. But the magic number generating functions need to live somewhere, and I'm not sure how that's possible without getting them standardized as part of C22 or whatever version is next. This strategy also isn't really optimal for linear indexing of array views since we'd still be recomputing the magic numbers every time we iterate over the view, instead of computing them once at view construction (and maybe using an LRU cache).
@StefanKarpinski Very true. We could do with 48 bits since at present that's the largest array size possible on extant x86-64 systems. I need to think more about whether that allows any optimizations besides avoiding the saturating increment with Fish's approach.
My impression is that you can uniformly use the round up method and the shift amount k
can be fixed. IIRC, that means that the only thing you need to remember is the magic multiplier.
@simonster, being able to precompute the magic numbers, and then use them multiple times, would be pretty amazing. That should basically wipe out the overhead of linear indexing, no?
Would we need an LRU cache? Or if we use @inline
will the compiler hoist them for us?
I'm not sure it would wipe out the overhead of linear indexing vs. Cartesian indexing. For Cartesian iteration over a matrix we need:
firstidx + (i1-1)*stride1 + (i2-1)*stride2
I think this can be implemented as written with 2 decs, 2 muls, and 2 adds, although a sufficiently smart compiler can leave only an add in the innermost loop. If @StefanKarpinski is correct we can implement linear indexing as:
x = mulhi((i - 1), magic) >> k
firstidx + ((i - 1) - x*sz1)*stride1 + x*stride2
which is 1 dec, 1 single operand mul, 1 shr, 3 ordinary muls, 1 sub, 2 adds, and maybe some movs, not many more instructions, but I think the compiler can only remove the dec from the loop. I think this will still be fewer cycles than an idiv alone, but I don't think it eliminates the overhead of linear indexing entirely. And the performance cost of linear indexing relative to Cartesian iteration will increase with the number of dimensions of the array.
I guess we need to decide if this is worthwhile at all given that caveat. I think it's worth seeing how much overhead there is in practice, but in the long term we will probably still want to push people toward Cartesian indexing (or maybe map
if we make it fast).
I was thinking that we'd use an LRU cache for array view construction to avoid having to recompute the magic number when multiple views of the same size are constructed in succession, since you might be doing something like:
for i = 1:size(X, 2)
f(view(X, a:b, a:b, i))
end
The indexing operations themselves can certainly be inlined, but the magic number computation is too much code for that to be reasonable.
I read through this issue, but it seems to cover many wide topics so I'm not sure what the focus is.
Does this also cover the potential optimization of disabling the check for divide-by-zero in cases where it can prove the denominator isn't zero, such as in this simple function, below? Or does it only cover cases where the denominator is known at compile time?
Here's the small example where I as a human can tell that y
will never be 0
:
julia> function safe_div(x::Int, y::Int)
if y === 0
return 0
end
return x ÷ y
end
safe_div (generic function with 1 method)
julia> @code_llvm safe_div(4, 0)
; @ REPL[43]:1 within `safe_div'
define i64 @julia_safe_div_561(i64 signext %0, i64 signext %1) {
top:
; @ REPL[43]:2 within `safe_div'
%.not = icmp eq i64 %1, 0
br i1 %.not, label %L3, label %L4
L3: ; preds = %top
; @ REPL[43]:3 within `safe_div'
ret i64 0
L4: ; preds = %top
; @ REPL[43]:5 within `safe_div'
; ┌ @ int.jl:261 within `div'
%2 = icmp ne i64 %1, -1
%3 = icmp ne i64 %0, -9223372036854775808
%4 = or i1 %3, %2
br i1 %4, label %pass, label %fail
fail: ; preds = %L4
call void @jl_throw({}* inttoptr (i64 4570615184 to {}*))
unreachable
pass: ; preds = %L4
%5 = sdiv i64 %0, %1
; └
ret i64 %5
}
You can see that it's retained the check in L4
and the throw DivideError
code in the fail:
block.
My guess is that this is all very obsolete and should be closed. @vtjnash, can you confirm?
Probably should be closed.
@NHDaly The LLVM optimizaition is peculiar:
It removed the null check but replaced it with true
and not false
and thus didn't fold the rest...
julia> @code_llvm optimize=false safe_div(0, 1)
; @ REPL[4]:1 within `safe_div`
define i64 @julia_safe_div_323(i64 signext %0, i64 signext %1) #0 {
top:
%2 = call {}*** @julia.get_pgcstack()
%3 = bitcast {}*** %2 to {}**
%current_task = getelementptr inbounds {}*, {}** %3, i64 2305843009213693940
%4 = bitcast {}** %current_task to i64*
%world_age = getelementptr inbounds i64, i64* %4, i64 13
; @ REPL[4]:2 within `safe_div`
%5 = icmp eq i64 %1, 0
%6 = zext i1 %5 to i8
%7 = trunc i8 %6 to i1
%8 = xor i1 %7, true
br i1 %8, label %L4, label %L3
L3: ; preds = %top
; @ REPL[4]:3 within `safe_div`
ret i64 0
L4: ; preds = %top
; @ REPL[4]:5 within `safe_div`
; ┌ @ int.jl:278 within `div`
%9 = icmp ne i64 %0, -9223372036854775808
%10 = icmp ne i64 %1, -1
%11 = or i1 %10, %9
%12 = icmp ne i64 %1, 0
%13 = and i1 %12, %11
br i1 %13, label %pass, label %fail
fail: ; preds = %L4
call void @jl_throw({}* inttoptr (i64 140192384492224 to {}*))
unreachable
pass: ; preds = %L4
%14 = sdiv i64 %0, %1
; └
ret i64 %14
}
Is the input and the:
%12 = icmp ne i64 %1, 0
%13 = and i1 %12, %11
is gone.
Ah silly me this is:
julia> div(typemin(Int), -1)
ERROR: DivideError: integer division error
Stacktrace:
[1] div(x::Int64, y::Int64)
@ Base ./int.jl:278
[2] top-level scope
@ REPL[1]:1
Oh, very interesting! thanks @oscardssmith and @vchuravy.
🤔 even if i had a manual check for that case too, it doesn't compile away, and I end up with a redundant check:
julia> function safe_div(x,y)
if y === 0 || (x === typemin(Int) && y === -1)
return 0
end
return x ÷ y
end
safe_div (generic function with 1 method)
julia> @code_llvm safe_div(4, 0)
; @ REPL[53]:1 within `safe_div'
define i64 @julia_safe_div_580(i64 signext %0, i64 signext %1) {
top:
; @ REPL[53]:2 within `safe_div'
%2 = icmp eq i64 %1, 0
br i1 %2, label %L12, label %L10
L10: ; preds = %top
%3 = icmp eq i64 %0, -9223372036854775808
%4 = icmp eq i64 %1, -1
%value_phi1 = and i1 %3, %4
br i1 %value_phi1, label %L12, label %L13
L12: ; preds = %L10, %top
; @ REPL[53]:3 within `safe_div'
ret i64 0
L13: ; preds = %L10
; @ REPL[53]:5 within `safe_div'
; ┌ @ int.jl:261 within `div'
%5 = icmp ne i64 %1, -1
%6 = icmp ne i64 %0, -9223372036854775808
%7 = or i1 %6, %5
br i1 %7, label %pass, label %fail
fail: ; preds = %L13
call void @jl_throw({}* inttoptr (i64 4570615184 to {}*))
unreachable
pass: ; preds = %L13
%8 = sdiv i64 %0, %1
; └
ret i64 %8
}
Notice how L10 and L13 are basically the same, but L13 was injected by LLVM?
Some null check elision may be having trouble with our monotonic annotations, so we may need to convince LLVM it is still a valid optimization.
It does quite well on unsigned numbers:
julia> function safe_div(x, y)
if iszero(y)
return zero(x)
end
return x ÷ y
end
safe_div (generic function with 2 methods)
julia> @code_llvm optimize=true safe_div(UInt(1), UInt(0))
; @ REPL[10]:1 within `safe_div`
define i64 @julia_safe_div_1146(i64 zeroext %0, i64 zeroext %1) #0 {
top:
; @ REPL[10]:2 within `safe_div`
; ┌ @ number.jl:42 within `iszero`
; │┌ @ promotion.jl:429 within `==`
%.not = icmp eq i64 %1, 0
; └└
br i1 %.not, label %L3, label %pass
L3: ; preds = %top
; @ REPL[10]:3 within `safe_div`
ret i64 0
pass: ; preds = %top
; @ REPL[10]:5 within `safe_div`
; ┌ @ int.jl:280 within `div`
%2 = udiv i64 %0, %1
; └
ret i64 %2
}
julia> function safe_div(x, y)
if iszero(y)
x, y = zero(x), one(y)
end
return x ÷ y
end
safe_div (generic function with 2 methods)
julia> @code_llvm optimize=true safe_div(UInt(1), UInt(0))
; @ REPL[14]:1 within `safe_div`
define i64 @julia_safe_div_1150(i64 zeroext %0, i64 zeroext %1) #0 {
top:
; @ REPL[14]:2 within `safe_div`
; ┌ @ number.jl:42 within `iszero`
; │┌ @ promotion.jl:429 within `==`
%.not = icmp eq i64 %1, 0
; └└
%.2 = select i1 %.not, i64 1, i64 %1
%. = select i1 %.not, i64 0, i64 %0
; @ REPL[14]:5 within `safe_div`
; ┌ @ int.jl:280 within `div`
%2 = udiv i64 %., %.2
; └
ret i64 %2
}
But I am not sure LLVM can currently track multiple properties of a number, to see that compare is dead.
function safe_div(x,y)
if y === 0 || !(x !== typemin(Int) || y !== -1)
return 0
end
return x ÷ y
end
Does optimize, but I needed to precisely match the form. Is this something that you actually care about?
If so we can look into teaching LLVM that optimization (or just use Core.Intrinsics.sdiv_int
in safe_div
).
These also work for me:
julia> function safe_div(x, y)
if iszero(y) || (x == typemin(x) && y == -1)
x, y = zero(x), one(y)
end
return x ÷ y
end
safe_div (generic function with 1 method)
julia> @code_llvm optimize=true safe_div(Int(1), Int(0))
; @ REPL[3]:1 within `safe_div`
define i64 @julia_safe_div_1008(i64 signext %0, i64 signext %1) #0 {
top:
; @ REPL[3]:2 within `safe_div`
; ┌ @ number.jl:42 within `iszero`
; │┌ @ promotion.jl:429 within `==`
%.not = icmp eq i64 %1, 0
; └└
br i1 %.not, label %pass, label %L4
L4: ; preds = %top
; ┌ @ promotion.jl:429 within `==`
%.not3 = icmp ne i64 %0, -9223372036854775808
; └
%2 = icmp ne i64 %1, -1
%3 = or i1 %.not3, %2
%spec.select = select i1 %3, i64 %1, i64 1
%spec.select8 = select i1 %3, i64 %0, i64 0
br label %pass
pass: ; preds = %L4, %top
%value_phi17 = phi i64 [ 1, %top ], [ %spec.select, %L4 ]
%value_phi6 = phi i64 [ 0, %top ], [ %spec.select8, %L4 ]
; @ REPL[3]:5 within `safe_div`
; ┌ @ int.jl:278 within `div`
%4 = sdiv i64 %value_phi6, %value_phi17
; └
ret i64 %4
}
julia> function safe_div(x, y)
if iszero(y) || (x == typemin(x) && y == -1)
return zero(x)
end
return x ÷ y
end
safe_div (generic function with 1 method)
julia> @code_llvm optimize=true safe_div(Int(1), Int(0))
; @ REPL[5]:1 within `safe_div`
define i64 @julia_safe_div_1038(i64 signext %0, i64 signext %1) #0 {
top:
; @ REPL[5]:2 within `safe_div`
; ┌ @ number.jl:42 within `iszero`
; │┌ @ promotion.jl:429 within `==`
%.not = icmp eq i64 %1, 0
; └└
br i1 %.not, label %L14, label %L4
L4: ; preds = %top
; ┌ @ promotion.jl:429 within `==`
%.not1 = icmp ne i64 %0, -9223372036854775808
; └
%2 = icmp ne i64 %1, -1
%3 = or i1 %.not1, %2
br i1 %3, label %pass, label %L14
L14: ; preds = %L4, %top
; @ REPL[5]:3 within `safe_div`
ret i64 0
pass: ; preds = %L4
; @ REPL[5]:5 within `safe_div`
; ┌ @ int.jl:278 within `div`
%4 = sdiv i64 %0, %1
; └
ret i64 %4
}
oh, I see, LLVM v12 (aka Julia master) seems to optimize this, but LLVM v11 did not.
This may be a good change to make for ranges.jl, since it would be beneficial for LLVM to be able to prove that computing length
is error-free.
Oh wow, thanks everyone! Very interesting.
Also, 🤔 I think looking at LLVM might not be the right thing here, since it seems that there's (another?) pass that occurs during native codegen, where sdiv
is expanded into a bounds check?:
julia> function safe_div(x,y)
return Core.Intrinsics.sdiv_int(x,y)
end
safe_div (generic function with 1 method)
julia> @code_llvm safe_div(4, 0)
; @ REPL[9]:1 within `safe_div`
define i64 @julia_safe_div_257(i64 signext %0, i64 signext %1) #0 {
top:
; @ REPL[9]:2 within `safe_div`
%2 = sdiv i64 %0, %1
ret i64 %2
}
julia> @code_native safe_div(4, 0)
.section __TEXT,__text,regular,pure_instructions
; ┌ @ REPL[9]:1 within `safe_div`
movq %rdi, %rax
; │ @ REPL[9]:2 within `safe_div`
movq %rdi, %rcx
orq %rsi, %rcx
shrq $32, %rcx
je L21
cqto
idivq %rsi
retq
L21:
xorl %edx, %edx
divl %esi
retq
nopw (%rax,%rax)
; └
julia> safe_div(4, 0)
ERROR: DivideError: integer division error
Stacktrace:
[1] safe_div(x::Int64, y::Int64)
@ Main ./REPL[9]:2
And from what I can tell, even on master, even though the LLVM now doesn't have the redundant checks, the emitted native code still does?:
julia> function safe_div(x, y)
if iszero(y) || (x == typemin(x) && y == -1)
return zero(x)
end
return Core.Intrinsics.sdiv_int(x, y)
end
safe_div (generic function with 1 method)
julia> @code_llvm optimize=true safe_div(Int(1), Int(0))
; @ REPL[21]:1 within `safe_div`
define i64 @julia_safe_div_285(i64 signext %0, i64 signext %1) #0 {
top:
; @ REPL[21]:2 within `safe_div`
; ┌ @ number.jl:42 within `iszero`
; │┌ @ promotion.jl:429 within `==`
%.not = icmp eq i64 %1, 0
; └└
br i1 %.not, label %L14, label %L4
L4: ; preds = %top
; ┌ @ promotion.jl:429 within `==`
%.not1 = icmp ne i64 %0, -9223372036854775808
; └
%2 = icmp ne i64 %1, -1
%3 = or i1 %.not1, %2
br i1 %3, label %L12, label %L14
L12: ; preds = %L4
; @ REPL[21]:5 within `safe_div`
%4 = sdiv i64 %0, %1
ret i64 %4
L14: ; preds = %L4, %top
; @ REPL[21]:3 within `safe_div`
ret i64 0
}
julia> @code_native safe_div(Int(1), Int(0))
.section __TEXT,__text,regular,pure_instructions
; ┌ @ REPL[21]:2 within `safe_div`
; │┌ @ number.jl:42 within `iszero`
; ││┌ @ promotion.jl:429 within `==`
testq %rsi, %rsi
; │└└
je L26
; │ @ REPL[21] within `safe_div`
movabsq $-9223372036854775808, %rax ## imm = 0x8000000000000000
; │ @ REPL[21]:2 within `safe_div`
; │┌ @ promotion.jl:429 within `==`
cmpq %rax, %rdi
; │└
jne L29
cmpq $-1, %rsi
jne L29
; │ @ REPL[21]:3 within `safe_div`
L26:
xorl %eax, %eax
retq
; │ @ REPL[21]:5 within `safe_div`
L29:
movq %rdi, %rax
orq %rsi, %rax
shrq $32, %rax
je L50
movq %rdi, %rax
cqto
idivq %rsi
retq
L50:
movl %edi, %eax
xorl %edx, %edx
divl %esi
retq
nopl (%rax)
; └
julia>
sdiv_int
is not checked_sdiv_int
oh oops, good point. @vchuravy i've edited the above example. :) Thanks
Is this something that you actually care about?
Probably not right now, yeah. Everything in our system is still WAY to slow for this to actually matter at all. :)
BUT we are building our own divide-by-zero checks into our query evaluator, since we want to handle them differently (not with a Julia exception), and it seemed unfortunate to me to then still have the compiled code checking for zero a second time even after we've already checked it...
I wanted to open this discussion now, because one day I could imagine us actually caring about this? :)
I wanted to open this discussion now, because one day I could imagine us actually caring about this? :)
Let's cross that's bridge when we get there xD
div/rem/mod are slow because we have to emit an extra test to avoid certain llvm optimizations. to fix, I think we need the following:
llvm::Constant
subtypesllvm::DivZeroUndef <: llvm::UndefValue
andllvm::DivZeroConstant <: llvm::Constant
--> this allows future optimization passes forundef
to still constant fold as before, without needing any changes. Optimization passes for constant valuese would need to re-wrap the result back into theDivZero
type after any constant folding pass;include/llvm/Support/PatternMatch.h
might need to be updated to delegate various tests correctly.idiv 0,0
operation (adding a case for our new types inSelectionDAGBuilder::getValueImpl
inlib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
should be sufficient here, but still might need to verify that downstream passes won't try to further legalize the operation – e.g. if the result is unused).SimplifyDiv
andSimplifyRem
inlib/Analysis/InstructionSimplify.cpp
andllvm::ConstantFoldBinaryInstruction
inlib/IR/ConstantFold.cpp
would need to be updated to emit the newllvm::Constant
types (when appropriate)(I can't find the existing issues, so I opened a new one)
@Keno does this sound right?