JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.9k stars 5.49k forks source link

faster div/mod/rem thoughts [llvm] #8188

Closed vtjnash closed 3 years ago

vtjnash commented 10 years ago

div/rem/mod are slow because we have to emit an extra test to avoid certain llvm optimizations. to fix, I think we need the following:

(I can't find the existing issues, so I opened a new one)

@Keno does this sound right?

Keno commented 10 years ago

I think all that's needed is to add a trapping flag (which doesn't exist yet, but should be possible to add) to indicate that the div and't be folded away unless the second argument is proved to be not zero. The Constant fold passes would of course have to be updates. I think that that approach would have an easier time going through upstream.

JeffBezanson commented 10 years ago

The trapping flag approach sounds right.

vtjnash commented 10 years ago

the flag part tells you half of it, i'm describing what the flag would do

StefanKarpinski commented 10 years ago

IIRC, what we determined at the LLVM meetup back in July was that you need a flag to tell LLVM that it is not free to optimize away integer division exceptions unless it can prove that they won't happen. In other words, you want a div instruction such that division by zero is defined and always results in an integer division exception.

simonster commented 10 years ago

Another problem is that div is just slow. Agner Fog's instruction tables say there is a latency of 40-103 cycles and a reciprocal throughput of 25-84 cycles for 64-bit IDIV on Sandy Bridge.

I tried implementing integer division of arrays by a scalar using this cute trick for division by multiplication and bit shifting, which is also what LLVM does when it encounters integer division by a constant. Code here. It is indeed much faster, even though we are slightly handicapped because we also emit an extra test when performing bit shifts. The test results there are for a large array, but the cutoff where it is faster to compute and apply the magic numbers than it is to use the division intrinsics is around 32 elements with Int64/Uint64. This trick could presumably also be used for faster linear indexing of array views.

simonster commented 10 years ago

I updated the gist with numbers for using the LLVM sdiv/udiv intrinsics directly and also fixed a bug in unsigned division. There is a cost to the checks of 15-25% in this benchmark, but part of the reason may be that the check appears to prevent both vectorization and hoisting loads of the array pointers. Still, the magic numbers are 2-3x faster.

timholy commented 10 years ago

That's really cool, @simonster! It's amazing how many lines of code can run faster than a single intrinsic.

StefanKarpinski commented 10 years ago

Less surprising when it's the x86_64 CHKEML instruction that checks your email or the VCHKEML vector instruction which checks multiple email accounts at once.

StefanKarpinski commented 10 years ago

@simonster, have you seen this on the front page of HN today:

http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html

Is this the same as the algorithm you're using?

simonster commented 10 years ago

I submitted that :smile:, so I've seen it, but I haven't implemented the modification yet. Should give a slight boost for the unsigned case.

simonster commented 10 years ago

Fish's method is faster, but not by much and only if we don't care about being able to divide typemax(T). It looks like LLVM can vectorize the classic "round-up" approach for Uint64, but not the distributive approach nor the saturating add approach from Fish's post. Non-saturating add is possible but would give the wrong results for typemax.

jakebolewski commented 10 years ago

This is really cool but won't this eventually work its way into LLVM? It seems like something that should be handled by the LLVM compiler and not in Julia.

vtjnash commented 10 years ago

in the near-term, we could see if Julia knows that the denominator is a constant != 0, and skip emitting the check

StefanKarpinski commented 10 years ago

@jakebolewski, there are situations where you want to precompute magic numbers but you can't do code gen. One example is multidimensional index calculations – you'd want to store a bunch of these magic numbers in a data structure.

StefanKarpinski commented 10 years ago

Although in those cases it might be good to have as uniform an algorithm as possible at the expense of generality. You almost never have to divide by a very large number with arrays and can afford to sacrifice some bits.

simonster commented 10 years ago

@jakebolewski LLVM already performs this optimization if the divisor is known at compile time. To perform the optimization for loops with division by a loop-invariant divisor that is not known at compile time, it would have to know that the optimization is profitable (for a loop with unknown trip count, this would require metadata or a branch). This doesn't seem completely unreasonable, since LLVM already has a lot of loop metadata for controlling the vectorizer. But the magic number generating functions need to live somewhere, and I'm not sure how that's possible without getting them standardized as part of C22 or whatever version is next. This strategy also isn't really optimal for linear indexing of array views since we'd still be recomputing the magic numbers every time we iterate over the view, instead of computing them once at view construction (and maybe using an LRU cache).

@StefanKarpinski Very true. We could do with 48 bits since at present that's the largest array size possible on extant x86-64 systems. I need to think more about whether that allows any optimizations besides avoiding the saturating increment with Fish's approach.

StefanKarpinski commented 10 years ago

My impression is that you can uniformly use the round up method and the shift amount k can be fixed. IIRC, that means that the only thing you need to remember is the magic multiplier.

timholy commented 10 years ago

@simonster, being able to precompute the magic numbers, and then use them multiple times, would be pretty amazing. That should basically wipe out the overhead of linear indexing, no?

Would we need an LRU cache? Or if we use @inline will the compiler hoist them for us?

simonster commented 10 years ago

I'm not sure it would wipe out the overhead of linear indexing vs. Cartesian indexing. For Cartesian iteration over a matrix we need:

firstidx + (i1-1)*stride1 + (i2-1)*stride2

I think this can be implemented as written with 2 decs, 2 muls, and 2 adds, although a sufficiently smart compiler can leave only an add in the innermost loop. If @StefanKarpinski is correct we can implement linear indexing as:

x = mulhi((i - 1), magic) >> k
firstidx + ((i - 1) - x*sz1)*stride1 + x*stride2

which is 1 dec, 1 single operand mul, 1 shr, 3 ordinary muls, 1 sub, 2 adds, and maybe some movs, not many more instructions, but I think the compiler can only remove the dec from the loop. I think this will still be fewer cycles than an idiv alone, but I don't think it eliminates the overhead of linear indexing entirely. And the performance cost of linear indexing relative to Cartesian iteration will increase with the number of dimensions of the array.

I guess we need to decide if this is worthwhile at all given that caveat. I think it's worth seeing how much overhead there is in practice, but in the long term we will probably still want to push people toward Cartesian indexing (or maybe map if we make it fast).

I was thinking that we'd use an LRU cache for array view construction to avoid having to recompute the magic number when multiple views of the same size are constructed in succession, since you might be doing something like:

for i = 1:size(X, 2)
    f(view(X, a:b, a:b, i))
end

The indexing operations themselves can certainly be inlined, but the magic number computation is too much code for that to be reasonable.

NHDaly commented 3 years ago

I read through this issue, but it seems to cover many wide topics so I'm not sure what the focus is.

Does this also cover the potential optimization of disabling the check for divide-by-zero in cases where it can prove the denominator isn't zero, such as in this simple function, below? Or does it only cover cases where the denominator is known at compile time?

Here's the small example where I as a human can tell that y will never be 0:

julia> function safe_div(x::Int, y::Int)
           if y === 0
               return 0
           end
           return x ÷ y
       end
safe_div (generic function with 1 method)

julia> @code_llvm safe_div(4, 0)
;  @ REPL[43]:1 within `safe_div'
define i64 @julia_safe_div_561(i64 signext %0, i64 signext %1) {
top:
;  @ REPL[43]:2 within `safe_div'
  %.not = icmp eq i64 %1, 0
  br i1 %.not, label %L3, label %L4

L3:                                               ; preds = %top
;  @ REPL[43]:3 within `safe_div'
  ret i64 0

L4:                                               ; preds = %top
;  @ REPL[43]:5 within `safe_div'
; ┌ @ int.jl:261 within `div'
   %2 = icmp ne i64 %1, -1
   %3 = icmp ne i64 %0, -9223372036854775808
   %4 = or i1 %3, %2
   br i1 %4, label %pass, label %fail

fail:                                             ; preds = %L4
   call void @jl_throw({}* inttoptr (i64 4570615184 to {}*))
   unreachable

pass:                                             ; preds = %L4
   %5 = sdiv i64 %0, %1
; └
  ret i64 %5
}

You can see that it's retained the check in L4 and the throw DivideError code in the fail: block.

oscardssmith commented 3 years ago

My guess is that this is all very obsolete and should be closed. @vtjnash, can you confirm?

vchuravy commented 3 years ago

Probably should be closed.

@NHDaly The LLVM optimizaition is peculiar:

It removed the null check but replaced it with true and not false and thus didn't fold the rest...

julia> @code_llvm optimize=false safe_div(0, 1)
;  @ REPL[4]:1 within `safe_div`
define i64 @julia_safe_div_323(i64 signext %0, i64 signext %1) #0 {
top:
  %2 = call {}*** @julia.get_pgcstack()
  %3 = bitcast {}*** %2 to {}**
  %current_task = getelementptr inbounds {}*, {}** %3, i64 2305843009213693940
  %4 = bitcast {}** %current_task to i64*
  %world_age = getelementptr inbounds i64, i64* %4, i64 13
;  @ REPL[4]:2 within `safe_div`
  %5 = icmp eq i64 %1, 0
  %6 = zext i1 %5 to i8
  %7 = trunc i8 %6 to i1
  %8 = xor i1 %7, true
  br i1 %8, label %L4, label %L3

L3:                                               ; preds = %top
;  @ REPL[4]:3 within `safe_div`
  ret i64 0

L4:                                               ; preds = %top
;  @ REPL[4]:5 within `safe_div`
; ┌ @ int.jl:278 within `div`
   %9 = icmp ne i64 %0, -9223372036854775808
   %10 = icmp ne i64 %1, -1
   %11 = or i1 %10, %9
   %12 = icmp ne i64 %1, 0
   %13 = and i1 %12, %11
   br i1 %13, label %pass, label %fail

fail:                                             ; preds = %L4
   call void @jl_throw({}* inttoptr (i64 140192384492224 to {}*))
   unreachable

pass:                                             ; preds = %L4
   %14 = sdiv i64 %0, %1
; └
  ret i64 %14
}

Is the input and the:

   %12 = icmp ne i64 %1, 0
   %13 = and i1 %12, %11

is gone.

vchuravy commented 3 years ago

Ah silly me this is:

julia> div(typemin(Int), -1)
ERROR: DivideError: integer division error
Stacktrace:
 [1] div(x::Int64, y::Int64)
   @ Base ./int.jl:278
 [2] top-level scope
   @ REPL[1]:1
NHDaly commented 3 years ago

Oh, very interesting! thanks @oscardssmith and @vchuravy.

🤔 even if i had a manual check for that case too, it doesn't compile away, and I end up with a redundant check:

julia> function safe_div(x,y)
           if y === 0 || (x === typemin(Int) && y === -1)
               return 0
           end
           return x ÷ y
       end
safe_div (generic function with 1 method)

julia> @code_llvm safe_div(4, 0)
;  @ REPL[53]:1 within `safe_div'
define i64 @julia_safe_div_580(i64 signext %0, i64 signext %1) {
top:
;  @ REPL[53]:2 within `safe_div'
  %2 = icmp eq i64 %1, 0
  br i1 %2, label %L12, label %L10

L10:                                              ; preds = %top
  %3 = icmp eq i64 %0, -9223372036854775808
  %4 = icmp eq i64 %1, -1
  %value_phi1 = and i1 %3, %4
  br i1 %value_phi1, label %L12, label %L13

L12:                                              ; preds = %L10, %top
;  @ REPL[53]:3 within `safe_div'
  ret i64 0

L13:                                              ; preds = %L10
;  @ REPL[53]:5 within `safe_div'
; ┌ @ int.jl:261 within `div'
   %5 = icmp ne i64 %1, -1
   %6 = icmp ne i64 %0, -9223372036854775808
   %7 = or i1 %6, %5
   br i1 %7, label %pass, label %fail

fail:                                             ; preds = %L13
   call void @jl_throw({}* inttoptr (i64 4570615184 to {}*))
   unreachable

pass:                                             ; preds = %L13
   %8 = sdiv i64 %0, %1
; └
  ret i64 %8
}

Notice how L10 and L13 are basically the same, but L13 was injected by LLVM?

vtjnash commented 3 years ago

Some null check elision may be having trouble with our monotonic annotations, so we may need to convince LLVM it is still a valid optimization.

It does quite well on unsigned numbers:

julia> function safe_div(x, y)
           if iszero(y)
               return zero(x)
           end
           return x ÷ y
       end
safe_div (generic function with 2 methods)

julia> @code_llvm optimize=true safe_div(UInt(1), UInt(0))
;  @ REPL[10]:1 within `safe_div`
define i64 @julia_safe_div_1146(i64 zeroext %0, i64 zeroext %1) #0 {
top:
;  @ REPL[10]:2 within `safe_div`
; ┌ @ number.jl:42 within `iszero`
; │┌ @ promotion.jl:429 within `==`
    %.not = icmp eq i64 %1, 0
; └└
  br i1 %.not, label %L3, label %pass

L3:                                               ; preds = %top
;  @ REPL[10]:3 within `safe_div`
  ret i64 0

pass:                                             ; preds = %top
;  @ REPL[10]:5 within `safe_div`
; ┌ @ int.jl:280 within `div`
   %2 = udiv i64 %0, %1
; └
  ret i64 %2
}

julia> function safe_div(x, y)
           if iszero(y)
               x, y = zero(x), one(y)
           end
           return x ÷ y
       end
safe_div (generic function with 2 methods)

julia> @code_llvm optimize=true safe_div(UInt(1), UInt(0))
;  @ REPL[14]:1 within `safe_div`
define i64 @julia_safe_div_1150(i64 zeroext %0, i64 zeroext %1) #0 {
top:
;  @ REPL[14]:2 within `safe_div`
; ┌ @ number.jl:42 within `iszero`
; │┌ @ promotion.jl:429 within `==`
    %.not = icmp eq i64 %1, 0
; └└
  %.2 = select i1 %.not, i64 1, i64 %1
  %. = select i1 %.not, i64 0, i64 %0
;  @ REPL[14]:5 within `safe_div`
; ┌ @ int.jl:280 within `div`
   %2 = udiv i64 %., %.2
; └
  ret i64 %2
}

But I am not sure LLVM can currently track multiple properties of a number, to see that compare is dead.

vchuravy commented 3 years ago
function safe_div(x,y)
                  if y === 0 || !(x !== typemin(Int) || y !== -1)
                      return 0
                  end
                  return x ÷ y
              end

Does optimize, but I needed to precisely match the form. Is this something that you actually care about? If so we can look into teaching LLVM that optimization (or just use Core.Intrinsics.sdiv_int in safe_div).

vtjnash commented 3 years ago

These also work for me:

julia> function safe_div(x, y)
           if iszero(y) || (x == typemin(x) && y == -1)
               x, y = zero(x), one(y)
           end
           return x ÷ y
       end
safe_div (generic function with 1 method)

julia> @code_llvm optimize=true safe_div(Int(1), Int(0))
;  @ REPL[3]:1 within `safe_div`
define i64 @julia_safe_div_1008(i64 signext %0, i64 signext %1) #0 {
top:
;  @ REPL[3]:2 within `safe_div`
; ┌ @ number.jl:42 within `iszero`
; │┌ @ promotion.jl:429 within `==`
    %.not = icmp eq i64 %1, 0
; └└
  br i1 %.not, label %pass, label %L4

L4:                                               ; preds = %top
; ┌ @ promotion.jl:429 within `==`
   %.not3 = icmp ne i64 %0, -9223372036854775808
; └
  %2 = icmp ne i64 %1, -1
  %3 = or i1 %.not3, %2
  %spec.select = select i1 %3, i64 %1, i64 1
  %spec.select8 = select i1 %3, i64 %0, i64 0
  br label %pass

pass:                                             ; preds = %L4, %top
  %value_phi17 = phi i64 [ 1, %top ], [ %spec.select, %L4 ]
  %value_phi6 = phi i64 [ 0, %top ], [ %spec.select8, %L4 ]
;  @ REPL[3]:5 within `safe_div`
; ┌ @ int.jl:278 within `div`
   %4 = sdiv i64 %value_phi6, %value_phi17
; └
  ret i64 %4
}

julia> function safe_div(x, y)
           if iszero(y) || (x == typemin(x) && y == -1)
               return zero(x)
           end
           return x ÷ y
       end
safe_div (generic function with 1 method)

julia> @code_llvm optimize=true safe_div(Int(1), Int(0))
;  @ REPL[5]:1 within `safe_div`
define i64 @julia_safe_div_1038(i64 signext %0, i64 signext %1) #0 {
top:
;  @ REPL[5]:2 within `safe_div`
; ┌ @ number.jl:42 within `iszero`
; │┌ @ promotion.jl:429 within `==`
    %.not = icmp eq i64 %1, 0
; └└
  br i1 %.not, label %L14, label %L4

L4:                                               ; preds = %top
; ┌ @ promotion.jl:429 within `==`
   %.not1 = icmp ne i64 %0, -9223372036854775808
; └
  %2 = icmp ne i64 %1, -1
  %3 = or i1 %.not1, %2
  br i1 %3, label %pass, label %L14

L14:                                              ; preds = %L4, %top
;  @ REPL[5]:3 within `safe_div`
  ret i64 0

pass:                                             ; preds = %L4
;  @ REPL[5]:5 within `safe_div`
; ┌ @ int.jl:278 within `div`
   %4 = sdiv i64 %0, %1
; └
  ret i64 %4
}
vtjnash commented 3 years ago

oh, I see, LLVM v12 (aka Julia master) seems to optimize this, but LLVM v11 did not.

This may be a good change to make for ranges.jl, since it would be beneficial for LLVM to be able to prove that computing length is error-free.

NHDaly commented 3 years ago

Oh wow, thanks everyone! Very interesting.

Also, 🤔 I think looking at LLVM might not be the right thing here, since it seems that there's (another?) pass that occurs during native codegen, where sdiv is expanded into a bounds check?:

julia> function safe_div(x,y)
           return Core.Intrinsics.sdiv_int(x,y)
       end
safe_div (generic function with 1 method)

julia> @code_llvm safe_div(4, 0)
;  @ REPL[9]:1 within `safe_div`
define i64 @julia_safe_div_257(i64 signext %0, i64 signext %1) #0 {
top:
;  @ REPL[9]:2 within `safe_div`
  %2 = sdiv i64 %0, %1
  ret i64 %2
}

julia> @code_native safe_div(4, 0)
        .section        __TEXT,__text,regular,pure_instructions
; ┌ @ REPL[9]:1 within `safe_div`
        movq    %rdi, %rax
; │ @ REPL[9]:2 within `safe_div`
        movq    %rdi, %rcx
        orq     %rsi, %rcx
        shrq    $32, %rcx
        je      L21
        cqto
        idivq   %rsi
        retq
L21:
        xorl    %edx, %edx
        divl    %esi
        retq
        nopw    (%rax,%rax)
; └

julia> safe_div(4, 0)
ERROR: DivideError: integer division error
Stacktrace:
 [1] safe_div(x::Int64, y::Int64)
   @ Main ./REPL[9]:2

And from what I can tell, even on master, even though the LLVM now doesn't have the redundant checks, the emitted native code still does?:

julia> function safe_div(x, y)
           if iszero(y) || (x == typemin(x) && y == -1)
               return zero(x)
           end
           return Core.Intrinsics.sdiv_int(x, y)
       end
safe_div (generic function with 1 method)

julia> @code_llvm optimize=true safe_div(Int(1), Int(0))
;  @ REPL[21]:1 within `safe_div`
define i64 @julia_safe_div_285(i64 signext %0, i64 signext %1) #0 {
top:
;  @ REPL[21]:2 within `safe_div`
; ┌ @ number.jl:42 within `iszero`
; │┌ @ promotion.jl:429 within `==`
    %.not = icmp eq i64 %1, 0
; └└
  br i1 %.not, label %L14, label %L4

L4:                                               ; preds = %top
; ┌ @ promotion.jl:429 within `==`
   %.not1 = icmp ne i64 %0, -9223372036854775808
; └
  %2 = icmp ne i64 %1, -1
  %3 = or i1 %.not1, %2
  br i1 %3, label %L12, label %L14

L12:                                              ; preds = %L4
;  @ REPL[21]:5 within `safe_div`
  %4 = sdiv i64 %0, %1
  ret i64 %4

L14:                                              ; preds = %L4, %top
;  @ REPL[21]:3 within `safe_div`
  ret i64 0
}

julia> @code_native safe_div(Int(1), Int(0))
        .section        __TEXT,__text,regular,pure_instructions
; ┌ @ REPL[21]:2 within `safe_div`
; │┌ @ number.jl:42 within `iszero`
; ││┌ @ promotion.jl:429 within `==`
        testq   %rsi, %rsi
; │└└
        je      L26
; │ @ REPL[21] within `safe_div`
        movabsq $-9223372036854775808, %rax     ## imm = 0x8000000000000000
; │ @ REPL[21]:2 within `safe_div`
; │┌ @ promotion.jl:429 within `==`
        cmpq    %rax, %rdi
; │└
        jne     L29
        cmpq    $-1, %rsi
        jne     L29
; │ @ REPL[21]:3 within `safe_div`
L26:
        xorl    %eax, %eax
        retq
; │ @ REPL[21]:5 within `safe_div`
L29:
        movq    %rdi, %rax
        orq     %rsi, %rax
        shrq    $32, %rax
        je      L50
        movq    %rdi, %rax
        cqto
        idivq   %rsi
        retq
L50:
        movl    %edi, %eax
        xorl    %edx, %edx
        divl    %esi
        retq
        nopl    (%rax)
; └

julia>
vchuravy commented 3 years ago

sdiv_int is not checked_sdiv_int

NHDaly commented 3 years ago

oh oops, good point. @vchuravy i've edited the above example. :) Thanks

NHDaly commented 3 years ago

Is this something that you actually care about?

Probably not right now, yeah. Everything in our system is still WAY to slow for this to actually matter at all. :)

BUT we are building our own divide-by-zero checks into our query evaluator, since we want to handle them differently (not with a Julia exception), and it seemed unfortunate to me to then still have the compiled code checking for zero a second time even after we've already checked it...

I wanted to open this discussion now, because one day I could imagine us actually caring about this? :)

vchuravy commented 3 years ago

I wanted to open this discussion now, because one day I could imagine us actually caring about this? :)

Let's cross that's bridge when we get there xD