naivesub! (solving triangular systems): why the less performant choice?

JuliaLang / julia

The Julia Programming Language

https://julialang.org/

MIT License

45.75k stars 5.49k forks source link

naivesub! (solving triangular systems): why the less performant choice? #18371

Open Jutho opened 8 years ago

Jutho commented 8 years ago

In julia's native method for solving upper triangular systems (naivesub!), on line 930 in base/linalg/triangular.jl, it is stated:

        for i in j-1:-1:1 # counterintuitively 1:j-1 performs slightly better

I decided to test this with the following code

function naivesub2!(A::UpperTriangular, b::AbstractVector, x::AbstractVector = b)
    n = size(A, 2)
    if !(n == length(b) == length(x))
        throw(DimensionMismatch("second dimension of left hand side A, $n, length of output x, $(length(x)), and length of right hand side b, $(length(b)), must be equal"))
    end
    @inbounds for j in n:-1:1
        A.data[j,j] == zero(A.data[j,j]) && throw(SingularException(j))
        xj = x[j] = A.data[j,j] \ b[j]
        for i in 1:j-1
            b[i] -= A.data[i,j] * xj
        end
    end
    x
end
function f(n,m,method)
       A=UpperTriangular(randn(n,n));
       v=[randn(n) for k=1:m];
       t=@elapsed @inbounds for i=1:m
           method(A,v[i])
       end
       return t/m
end
N = round(Int,logspace(1,3,50))
Tlapack = [f(n,1000,A_ldiv_B!) for n in N]
Tnaivesub = [f(n,1000,Base.LinAlg.naivesub!) for n in N]
Tnaivesub2 = [f(n,1000,naivesub2!) for n in N]

I think the results are best illustrated on a plot, showing timings divided by n^2, where y1 is lapack, y2 is naivesub! and y3 is naivesub2!. comparison

So it seems, slightly better means, a whole lot better. This raises two questions: 1) Why was the less performant loop order in naivesub! chosen? 2) Why even bother using lapack? naivesub2 seems to be roughly a factor 2 faster.

PS: my versioninfo()

Julia Version 0.5.0-rc3+0
Commit e6f843b (2016-08-22 23:43 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.6.0)
  CPU: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, ivybridge)

KristofferC commented 8 years ago

FWIW, adding @simd in front of for i in 1:j-1 seems to give a benefit for larger matrices, although not very big:

ylabel is timing relative Lapack:

simonbyrne commented 8 years ago

I could easily imagine the advantage of over lapack for small matrices, but for larger ones is quite odd.

FWIW, it appears that lapack uses forward iteration.

Jutho commented 8 years ago

I guess you mean the disadvantage of lapack for small matrices?

My motivation for looking into this was indeed small'ish matrices (e.g. what you need for GMRES, something in the range 10 - 50). But I was also surprised to find such a difference for large matrices. Maybe somebody with MKL installed can try to benchmark A_ldiv_B! in that case?

simonbyrne commented 8 years ago

Yes, I meant to write over, not of (I fixed the comment above).

I also saw an extra small improvement using muladd on the inner loop (when using a Haswell-compiled sysimg).

I guess we need to do some benchmarking to figure out where LAPACK is slower. Unlike BLAS operations where you can optimize blocking and cache sizes, in this case you can't really do anything clever, so with some proper @simd and @inbounds you should have basically the optimal code.

simonbyrne commented 8 years ago

It appeared in https://github.com/JuliaLang/julia/commit/7ab15d8a7b3d424d929fc9edd2c9c129b95c9d81#diff-a73c39358f896eb4f7ef800a8bf481ceR779, @Sacha0 ?

Sacha0 commented 8 years ago

IIRC: My mental model suggested that the performance difference should go the other way. Specifically, the for i in j-1:-1:1 inner loop order should yield sequential (if 'backwards') access of A and b, whereas the for i in 1:j-1 inner loop order should involve size-j access jumps of A and b on each outer loop iteration. (Though that may well be incorrect.) Then on the particular problem / hardware I tested, the counterintuitive performance difference between the inner loop orderings seemed small. So I conjectured the performance difference was a mystery of the hardware/compiler and volatile, left the inner loop in the intuitive order, and moved on. Ref. #14471 and #14475 for background on that commit. Will have a closer look at this later today and report back. Best!

Sacha0 commented 8 years ago

I observe a performance difference similar to that shown above (on the machine I tested with previously, and through the problem size I tested previously, and with tests of that problem size completing in roughly the same time they did previously). Present benchmark:

function fwdnaivesub!(A::UpperTriangular, b::AbstractVector, x::AbstractVector = b)
    n = size(A, 2)
    @inbounds for j in n:-1:1
        A.data[j,j] == zero(A.data[j,j]) && throw(SingularException(j))
        xj = x[j] = A.data[j,j] \ b[j]
        for i in 1:(j-1) # counterintuitively 1:j-1 performs slightly better
            b[i] -= A.data[i,j] * xj
        end
    end
    x
end

function bwdnaivesub!(A::UpperTriangular, b::AbstractVector, x::AbstractVector = b)
    n = size(A, 2)
    @inbounds for j in n:-1:1
        A.data[j,j] == zero(A.data[j,j]) && throw(SingularException(j))
        xj = x[j] = A.data[j,j] \ b[j]
        for i in j-1:-1:1 # counterintuitively 1:j-1 performs slightly better
            b[i] -= A.data[i,j] * xj
        end
    end
    x
end

using BenchmarkTools

for n in (10, 100, 1000, 5000)
    A = UpperTriangular(eye(n) + randn(n,n)/(2n))
    bo = rand(n)
    b = similar(bo)
    x = similar(bo)
    copy!(b, bo); fwdipbench = @benchmark fwdnaivesub!($A, $b)
    copy!(b, bo); bwdipbench = @benchmark bwdnaivesub!($A, $b)
    copy!(b, bo); fwdopbench = @benchmark fwdnaivesub!($A, $b, $x)
    copy!(b, bo); bwdopbench = @benchmark bwdnaivesub!($A, $b, $x)
    print("$n x $n UpperTriangular solve, in place, bwd/fwd, ")
    println(judge(median(bwdipbench), median(fwdipbench)))
    print("$n x $n UpperTriangular solve, out of place, bwd/fwd, ")
    println(judge(median(bwdopbench), median(fwdopbench)))
    print("$n x $n UpperTriangular solve, bwd, in place / out of place, ")
    println(judge(median(bwdipbench), median(bwdopbench)))
    println()
end

Results:

10 x 10 UpperTriangular solve, in place, bwd/fwd, BenchmarkTools.TrialJudgement:
  time:   +96.59% => regression (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
10 x 10 UpperTriangular solve, out of place, bwd/fwd, BenchmarkTools.TrialJudgement:
  time:   +59.29% => regression (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
10 x 10 UpperTriangular solve, bwd, in place / out of place, BenchmarkTools.TrialJudgement:
  time:   -3.89% => invariant (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

100 x 100 UpperTriangular solve, in place, bwd/fwd, BenchmarkTools.TrialJudgement:
  time:   +147.21% => regression (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
100 x 100 UpperTriangular solve, out of place, bwd/fwd, BenchmarkTools.TrialJudgement:
  time:   +143.52% => regression (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
100 x 100 UpperTriangular solve, bwd, in place / out of place, BenchmarkTools.TrialJudgement:
  time:   -0.17% => invariant (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

1000 x 1000 UpperTriangular solve, in place, bwd/fwd, BenchmarkTools.TrialJudgement:
  time:   +69.07% => regression (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
1000 x 1000 UpperTriangular solve, out of place, bwd/fwd, BenchmarkTools.TrialJudgement:
  time:   +68.34% => regression (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
1000 x 1000 UpperTriangular solve, bwd, in place / out of place, BenchmarkTools.TrialJudgement:
  time:   -0.23% => invariant (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

5000 x 5000 UpperTriangular solve, in place, bwd/fwd, BenchmarkTools.TrialJudgement:
  time:   +64.42% => regression (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
5000 x 5000 UpperTriangular solve, out of place, bwd/fwd, BenchmarkTools.TrialJudgement:
  time:   +64.63% => regression (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
5000 x 5000 UpperTriangular solve, bwd, in place / out of place, BenchmarkTools.TrialJudgement:
  time:   -0.20% => invariant (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

Sacha0 commented 8 years ago

In contrast to https://github.com/JuliaLang/julia/pull/14475#issue-123723874, the LLVM IR / native code for the two inner loop orderings now differ substantially. The LLVM IR suggests that the forward iteration vectorizes more easily.

Sacha0 commented 8 years ago

@simd decoration improves performance of both inner loop orderings on this machine (Core i7-3520M), though backward iteration still lags significantly behind forward iteration. Loose working theory: The non-packed nature of the underlying storage renders the intuition I expressed above incorrect (access jumps occurring with either ordering), and the compiler has an easier time reasoning about (and consequently generates better code for) the forward iteration. Testing a packed storage format would be interesting. @simd decoration benchmark:

function fwdnaivesub!(A::UpperTriangular, b::AbstractVector, x::AbstractVector = b)
    n = size(A, 2)
    for j in n:-1:1
        @inbounds A.data[j,j] == zero(A.data[j,j]) && throw(SingularException(j))
        @inbounds xj = x[j] = A.data[j,j] \ b[j]
        for i in 1:(j-1) # counterintuitively 1:j-1 performs slightly better
            @inbounds b[i] -= A.data[i,j] * xj
        end
    end
    x
end

function fwdsimdnaivesub!(A::UpperTriangular, b::AbstractVector, x::AbstractVector = b)
    n = size(A, 2)
    for j in n:-1:1
        @inbounds A.data[j,j] == zero(A.data[j,j]) && throw(SingularException(j))
        @inbounds xj = x[j] = A.data[j,j] \ b[j]
        @simd for i in 1:(j-1)
            @inbounds b[i] -= A.data[i,j] * xj
        end
    end
    x
end

function bwdnaivesub!(A::UpperTriangular, b::AbstractVector, x::AbstractVector = b)
    n = size(A, 2)
    for j in n:-1:1
        @inbounds A.data[j,j] == zero(A.data[j,j]) && throw(SingularException(j))
        @inbounds xj = x[j] = A.data[j,j] \ b[j]
        for i in j-1:-1:1
            @inbounds b[i] -= A.data[i,j] * xj
        end
    end
    x
end

function bwdsimdnaivesub!(A::UpperTriangular, b::AbstractVector, x::AbstractVector = b)
    n = size(A, 2)
    for j in n:-1:1
        @inbounds A.data[j,j] == zero(A.data[j,j]) && throw(SingularException(j))
        @inbounds xj = x[j] = A.data[j,j] \ b[j]
        @simd for i in j-1:-1:1
            @inbounds b[i] -= A.data[i,j] * xj
        end
    end
    x
end

using BenchmarkTools

for n in (10, 100, 1000, 5000, 10000, 15000)
    A = UpperTriangular(eye(n) + randn(n,n)/(2n))
    bo = rand(n)
    b = similar(bo)
    x = similar(bo)
    copy!(b, bo); fwdipbench = @benchmark fwdnaivesub!($A, $b)
    copy!(b, bo); bwdipbench = @benchmark bwdnaivesub!($A, $b)
    copy!(b, bo); fwdsimdipbench = @benchmark fwdsimdnaivesub!($A, $b)
    copy!(b, bo); bwdsimdipbench = @benchmark bwdsimdnaivesub!($A, $b)
    print("$n x $n UpperTriangular solve, in place, simdbwd/bwd, ")
    println(judge(median(bwdsimdipbench), median(bwdipbench)))
    print("$n x $n UpperTriangular solve, in place, simdfwd/fwd, ")
    println(judge(median(fwdsimdipbench), median(fwdipbench)))
    print("$n x $n UpperTriangular solve, in place, fwd/bwd, ")
    println(judge(median(fwdipbench), median(bwdipbench)))
    print("$n x $n UpperTriangular solve, in place, simdfwd/simdbwd, ")
    println(judge(median(fwdsimdipbench), median(bwdsimdipbench)))
    println()
end

Results:

10 x 10 UpperTriangular solve, in place, simdbwd/bwd, BenchmarkTools.TrialJudgement:
  time:   -1.90% => invariant (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
10 x 10 UpperTriangular solve, in place, simdfwd/fwd, BenchmarkTools.TrialJudgement:
  time:   -1.00% => invariant (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
10 x 10 UpperTriangular solve, in place, fwd/bwd, BenchmarkTools.TrialJudgement:
  time:   -36.82% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
10 x 10 UpperTriangular solve, in place, simdfwd/simdbwd, BenchmarkTools.TrialJudgement:
  time:   -36.25% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

100 x 100 UpperTriangular solve, in place, simdbwd/bwd, BenchmarkTools.TrialJudgement:
  time:   -5.08% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
100 x 100 UpperTriangular solve, in place, simdfwd/fwd, BenchmarkTools.TrialJudgement:
  time:   -20.46% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
100 x 100 UpperTriangular solve, in place, fwd/bwd, BenchmarkTools.TrialJudgement:
  time:   -59.00% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
100 x 100 UpperTriangular solve, in place, simdfwd/simdbwd, BenchmarkTools.TrialJudgement:
  time:   -65.65% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

1000 x 1000 UpperTriangular solve, in place, simdbwd/bwd, BenchmarkTools.TrialJudgement:
  time:   -22.86% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
1000 x 1000 UpperTriangular solve, in place, simdfwd/fwd, BenchmarkTools.TrialJudgement:
  time:   -11.52% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
1000 x 1000 UpperTriangular solve, in place, fwd/bwd, BenchmarkTools.TrialJudgement:
  time:   -40.99% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
1000 x 1000 UpperTriangular solve, in place, simdfwd/simdbwd, BenchmarkTools.TrialJudgement:
  time:   -32.31% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

5000 x 5000 UpperTriangular solve, in place, simdbwd/bwd, BenchmarkTools.TrialJudgement:
  time:   -25.34% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
5000 x 5000 UpperTriangular solve, in place, simdfwd/fwd, BenchmarkTools.TrialJudgement:
  time:   -4.28% => invariant (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
5000 x 5000 UpperTriangular solve, in place, fwd/bwd, BenchmarkTools.TrialJudgement:
  time:   -38.40% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
5000 x 5000 UpperTriangular solve, in place, simdfwd/simdbwd, BenchmarkTools.TrialJudgement:
  time:   -21.03% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

10000 x 10000 UpperTriangular solve, in place, simdbwd/bwd, BenchmarkTools.TrialJudgement:
  time:   -25.90% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
10000 x 10000 UpperTriangular solve, in place, simdfwd/fwd, BenchmarkTools.TrialJudgement:
  time:   -2.34% => invariant (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
10000 x 10000 UpperTriangular solve, in place, fwd/bwd, BenchmarkTools.TrialJudgement:
  time:   -37.49% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
10000 x 10000 UpperTriangular solve, in place, simdfwd/simdbwd, BenchmarkTools.TrialJudgement:
  time:   -17.61% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

15000 x 15000 UpperTriangular solve, in place, simdbwd/bwd, BenchmarkTools.TrialJudgement:
  time:   -26.47% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
15000 x 15000 UpperTriangular solve, in place, simdfwd/fwd, BenchmarkTools.TrialJudgement:
  time:   -5.25% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
15000 x 15000 UpperTriangular solve, in place, fwd/bwd, BenchmarkTools.TrialJudgement:
  time:   -37.86% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
15000 x 15000 UpperTriangular solve, in place, simdfwd/simdbwd, BenchmarkTools.TrialJudgement:
  time:   -19.93% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

Jutho commented 8 years ago

I also don't understand your initial intuition about the jumps in every outer loop. Ok, in forward ordering, there is a memory jump from the first two statements in the outer loop to the start of the inner loop. But then from the inner loop to the first statement of the next outer iteration is not really a jump. In the backward iteration, there is no jump from the statements in the outer loop to the inner loop, but then there is a jump from the end of the inner loop to the first statement in the next iteration of the outer loop. So both strategies probably have a comparable memory jump per outer iteration, but the forward iteration is favorable for other optimizations.

Sacha0 commented 8 years ago

Thinking more carefully via writing:

Forward inner loop iteration, in place

On entering the outer loop, we read A[j,j] and read/write b[j]. The last-touched entries were A[j,j+1] and b[j], so we incur: (1) no jump in b; and (2) a length-N backward jump through A where the storage underlying A is not packed, and a length-j backward jump through A where the storage underlying A is packed.

On entering the inner loop, we then read A[1,j] and read/write b[1], incurring length-j-1 jumps backward through both A and b. We then iterate sequentially through A and b till A[j-1,j] and b[j-1], incurring no additional jumps.

So for each outer loop iteration, forward inner iteration involves one length-N (length-j packed) jump backward through A, one length-j-1 jump backward through A, and one length-j-1 jump backward through b.

Backward inner loop iteration, in place

On entering the outer loop, we read A[j,j] and read/write b[j]. The last-touched entries were A[1,j+1] and b[1], so we incur: (1) a length-j-1 forward jump through b; and (2) a length-N-j+1 backward jump through A where the storage underlying A is not packed, and sequential (backward) access through A where the storage underlying A is packed.

On entering the inner loop, we then read A[j-1,j] and read/write b[j-1], continuing sequential (backward) access through both A and b. We then iterate sequentially (backward) through A and b till A[1,j] and b[1], incurring no jumps.

So for each outer loop iteration, backward inner iteration involves one length-N-j+1 backward through A (none for packed underlying storage) and one length-j-1 forward jump through b.

Overall

Overall the jumps through b are comparable, and in any case b should stay relatively high in the cache hierarchy such that those jumps are less costly than jumps through A. Forward iteration appears to involve two jumps through A that the machine may have trouble predicting per outer loop iteration, whereas backward iteration appears to involve only one.

I imagine I'm making a mistake somewhere above? Thoughts @jutho? Thanks and best!

Jutho commented 8 years ago

This looks ok; certainly more careful and less sloppy than my reasoning (I just assumed the access of A would be the same to that of b, thus ignoring its matrix and triangular character). Then I guess there is no good explanation for the observed timings, other than that julia/LLVM is able to optimize the forward iteration better.

The intriguing part is however still the time difference with lapack.

Sacha0 commented 8 years ago

The intriguing part is however still the time difference with lapack.

Agreed, that discrepancy is curious.

Would decorating with @simd and switching the inner loop order impact your application's performance enough to warrant a PR? If so, I would be happy to prepare such a PR or review one you submit. Best!

Jutho commented 8 years ago

I can prepare a PR, but I think the more general question is whether we want to use naivesub! over Lapack always. Currently, naivesub! is only used for those eltypes not supported by Lapack, like BigFloat. I have not benchmarked but assume there is no @simd benefit in that case and the difference between both loop orders is less significant.

Sacha0 commented 8 years ago

Using naivesub! across the board seems reasonable at first blush. Edge cases of concern? Best!

chriscoey commented 6 years ago

Howdy! I need to solve a lot of triangular systems (mostly dense arrays) and I'm wondering if these benchmarks you guys did (thank you!) are still accurate for the latest Julia. Can we revive this conversation?

Sacha0 commented 6 years ago

I'm wondering if these benchmarks you guys did (thank you!) are still accurate for the latest Julia. Can we revive this conversation?

Good question! If you have the requisite spare bandwidth, perhaps trial the benchmarks in https://github.com/JuliaLang/julia/issues/18371#issuecomment-246154253 on master? If you make an attempt and run into any challenges, please don't hesitate to ping me on slack for a second pair of eyes! :) Best!

chriscoey commented 6 years ago

Following @Sacha0's advice, I updated the benchmark code from https://github.com/JuliaLang/julia/issues/18371#issuecomment-246154253 (no change to the functions being benchmarked) as follows:

using LinearAlgebra

function fwdnaivesub!(A::UpperTriangular, b::AbstractVector, x::AbstractVector = b)
    n = size(A, 2)
    for j in n:-1:1
        @inbounds A.data[j,j] == zero(A.data[j,j]) && throw(SingularException(j))
        @inbounds xj = x[j] = A.data[j,j] \ b[j]
        for i in 1:(j-1) # counterintuitively 1:j-1 performs slightly better
            @inbounds b[i] -= A.data[i,j] * xj
        end
    end
    x
end

function fwdsimdnaivesub!(A::UpperTriangular, b::AbstractVector, x::AbstractVector = b)
    n = size(A, 2)
    for j in n:-1:1
        @inbounds A.data[j,j] == zero(A.data[j,j]) && throw(SingularException(j))
        @inbounds xj = x[j] = A.data[j,j] \ b[j]
        @simd for i in 1:(j-1)
            @inbounds b[i] -= A.data[i,j] * xj
        end
    end
    x
end

function bwdnaivesub!(A::UpperTriangular, b::AbstractVector, x::AbstractVector = b)
    n = size(A, 2)
    for j in n:-1:1
        @inbounds A.data[j,j] == zero(A.data[j,j]) && throw(SingularException(j))
        @inbounds xj = x[j] = A.data[j,j] \ b[j]
        for i in j-1:-1:1
            @inbounds b[i] -= A.data[i,j] * xj
        end
    end
    x
end

function bwdsimdnaivesub!(A::UpperTriangular, b::AbstractVector, x::AbstractVector = b)
    n = size(A, 2)
    for j in n:-1:1
        @inbounds A.data[j,j] == zero(A.data[j,j]) && throw(SingularException(j))
        @inbounds xj = x[j] = A.data[j,j] \ b[j]
        @simd for i in j-1:-1:1
            @inbounds b[i] -= A.data[i,j] * xj
        end
    end
    x
end

using BenchmarkTools
using Statistics

for n in (10, 100, 1000, 5000, 10000, 15000)
    println("$n x $n UpperTriangular solve, in place")
    A = UpperTriangular(Matrix{Float64}(I, n, n) + randn(n,n)/(2n))
    bo = rand(n)
    b = similar(bo)
    x = similar(bo)
    copyto!(b, bo); fwdipbench = @benchmark fwdnaivesub!($A, $b)
    copyto!(b, bo); bwdipbench = @benchmark bwdnaivesub!($A, $b)
    copyto!(b, bo); fwdsimdipbench = @benchmark fwdsimdnaivesub!($A, $b)
    copyto!(b, bo); bwdsimdipbench = @benchmark bwdsimdnaivesub!($A, $b)
    println("simdbwd/bwd:")
    println(judge(median(bwdsimdipbench), median(bwdipbench)))
    println("simdfwd/fwd:")
    println(judge(median(fwdsimdipbench), median(fwdipbench)))
    println("fwd/bwd:")
    println(judge(median(fwdipbench), median(bwdipbench)))
    println("simdfwd/simdbwd:")
    println(judge(median(fwdsimdipbench), median(bwdsimdipbench)))
    println()
end

On:

Julia Version 1.1.0-DEV.115
Commit eb8a9333b0 (2018-08-24 19:12 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, sandybridge)

I get the following results (note the simd vs not-simd results for smaller matrices were highly variable - sometimes +50%, sometimes -50%):

10 x 10 UpperTriangular solve, in place
simdbwd/bwd:
TrialJudgement(-25.32% => improvement)
simdfwd/fwd:
TrialJudgement(+1.03% => invariant)
fwd/bwd:
TrialJudgement(-15.76% => improvement)
simdfwd/simdbwd:
TrialJudgement(+13.95% => regression)

100 x 100 UpperTriangular solve, in place
simdbwd/bwd:
TrialJudgement(+20.65% => regression)
simdfwd/fwd:
TrialJudgement(+9.67% => regression)
fwd/bwd:
TrialJudgement(-49.51% => improvement)
simdfwd/simdbwd:
TrialJudgement(-54.11% => improvement)

1000 x 1000 UpperTriangular solve, in place
simdbwd/bwd:
TrialJudgement(+8.28% => regression)
simdfwd/fwd:
TrialJudgement(+0.57% => invariant)
fwd/bwd:
TrialJudgement(-31.61% => improvement)
simdfwd/simdbwd:
TrialJudgement(-36.48% => improvement)

5000 x 5000 UpperTriangular solve, in place
simdbwd/bwd:
TrialJudgement(+11.16% => regression)
simdfwd/fwd:
TrialJudgement(+0.08% => invariant)
fwd/bwd:
TrialJudgement(-19.51% => improvement)
simdfwd/simdbwd:
TrialJudgement(-27.54% => improvement)

10000 x 10000 UpperTriangular solve, in place
simdbwd/bwd:
TrialJudgement(+2.42% => invariant)
simdfwd/fwd:
TrialJudgement(-0.37% => invariant)
fwd/bwd:
TrialJudgement(-20.56% => improvement)
simdfwd/simdbwd:
TrialJudgement(-22.73% => improvement)

15000 x 15000 UpperTriangular solve, in place
simdbwd/bwd:
TrialJudgement(+1.58% => invariant)
simdfwd/fwd:
TrialJudgement(-0.63% => invariant)
fwd/bwd:
TrialJudgement(-20.46% => improvement)
simdfwd/simdbwd:
TrialJudgement(-22.19% => improvement)

Jutho commented 6 years ago

I've decided to add LAPACK.trtrs! (which is currently used by ldiv!) and BLAS.trsv! to the list (because of https://discourse.julialang.org/t/why-lapack-trtrs-not-blas-trsv/14239):

for n in (10, 100, 1000, 5000, 10000, 15000)
    println("$n x $n UpperTriangular solve, in place")
    A = UpperTriangular(Matrix{Float64}(I, n, n) + randn(n,n)/(2n))
    bo = rand(n)
    b = similar(bo)
    x = similar(bo)
    fwdipbench = @benchmark fwdnaivesub!($A, $b) setup = copyto!($b, $bo);
    bwdipbench = @benchmark bwdnaivesub!($A, $b) setup = copyto!($b, $bo);
    fwdsimdipbench = @benchmark fwdsimdnaivesub!($A, $b) setup = copyto!($b, $bo);
    bwdsimdipbench = @benchmark bwdsimdnaivesub!($A, $b) setup = copyto!($b, $bo);
    lapackbench = @benchmark LAPACK.trtrs!('U','N','N',$(A.data), $b) setup = copyto!($b, $bo);
    blasbench = @benchmark BLAS.trsv!('U','N','N',$(A.data), $b) setup = copyto!($b, $bo);
    println("simdbwd/bwd:")
    println(judge(median(bwdsimdipbench), median(bwdipbench)))
    println("simdfwd/fwd:")
    println(judge(median(fwdsimdipbench), median(fwdipbench)))
    println("fwd/bwd:")
    println(judge(median(fwdipbench), median(bwdipbench)))
    println("simdfwd/simdbwd:")
    println(judge(median(fwdsimdipbench), median(bwdsimdipbench)))
    println("simdfwd/lapack:")
    println(judge(median(fwdsimdipbench), median(lapackbench)))
    println("simdfwd/blas:")
    println(judge(median(fwdsimdipbench), median(blasbench)))
    println()
end

resulting in:

10 x 10 UpperTriangular solve, in place
simdbwd/bwd:
TrialJudgement(+35.92% => regression)
simdfwd/fwd:
TrialJudgement(+4.04% => invariant)
fwd/bwd:
TrialJudgement(-37.87% => improvement)
simdfwd/simdbwd:
TrialJudgement(-52.44% => improvement)
simdfwd/lapack:
TrialJudgement(-60.84% => improvement)
simdfwd/blas:
TrialJudgement(-25.75% => improvement)

100 x 100 UpperTriangular solve, in place
simdbwd/bwd:
TrialJudgement(-25.98% => improvement)
simdfwd/fwd:
TrialJudgement(+6.56% => regression)
fwd/bwd:
TrialJudgement(-67.31% => improvement)
simdfwd/simdbwd:
TrialJudgement(-52.94% => improvement)
simdfwd/lapack:
TrialJudgement(-53.70% => improvement)
simdfwd/blas:
TrialJudgement(+0.57% => invariant)

1000 x 1000 UpperTriangular solve, in place
simdbwd/bwd:
TrialJudgement(+2.29% => invariant)
simdfwd/fwd:
TrialJudgement(+2.24% => invariant)
fwd/bwd:
TrialJudgement(-32.52% => improvement)
simdfwd/simdbwd:
TrialJudgement(-32.55% => improvement)
simdfwd/lapack:
TrialJudgement(-63.49% => improvement)
simdfwd/blas:
TrialJudgement(+22.04% => regression)

5000 x 5000 UpperTriangular solve, in place
simdbwd/bwd:
TrialJudgement(-2.44% => invariant)
simdfwd/fwd:
TrialJudgement(+0.53% => invariant)
fwd/bwd:
TrialJudgement(-22.59% => improvement)
simdfwd/simdbwd:
TrialJudgement(-20.23% => improvement)
simdfwd/lapack:
TrialJudgement(-49.64% => improvement)
simdfwd/blas:
TrialJudgement(+29.63% => regression)

10000 x 10000 UpperTriangular solve, in place
simdbwd/bwd:
TrialJudgement(+0.62% => invariant)
simdfwd/fwd:
TrialJudgement(-7.18% => improvement)
fwd/bwd:
TrialJudgement(-17.90% => improvement)
simdfwd/simdbwd:
TrialJudgement(-24.27% => improvement)
simdfwd/lapack:
TrialJudgement(-51.71% => improvement)
simdfwd/blas:
TrialJudgement(+12.75% => regression)

15000 x 15000 UpperTriangular solve, in place
simdbwd/bwd:
TrialJudgement(-2.38% => invariant)
simdfwd/fwd:
TrialJudgement(-4.20% => invariant)
fwd/bwd:
TrialJudgement(-15.09% => improvement)
simdfwd/simdbwd:
TrialJudgement(-16.67% => improvement)
simdfwd/lapack:
TrialJudgement(-45.85% => improvement)
simdfwd/blas:
TrialJudgement(+24.55% => regression)

So the current ldiv! is at least twice too slow over the whole range of different sizes (small and big). I think this really needs to be changed.

Jutho commented 6 years ago

Maybe somebody with MKL can run the same simulation (I should have added versioninfo to the previous post, but that was using OpenBLAS).