Open Jutho opened 8 years ago
FWIW, adding @simd
in front of for i in 1:j-1
seems to give a benefit for larger matrices, although not very big:
ylabel is timing relative Lapack:
I could easily imagine the advantage of over lapack for small matrices, but for larger ones is quite odd.
FWIW, it appears that lapack uses forward iteration.
I guess you mean the disadvantage of lapack for small matrices?
My motivation for looking into this was indeed small'ish matrices (e.g. what you need for GMRES, something in the range 10 - 50). But I was also surprised to find such a difference for large matrices. Maybe somebody with MKL installed can try to benchmark A_ldiv_B!
in that case?
Yes, I meant to write over, not of (I fixed the comment above).
I also saw an extra small improvement using muladd
on the inner loop (when using a Haswell-compiled sysimg).
I guess we need to do some benchmarking to figure out where LAPACK is slower. Unlike BLAS operations where you can optimize blocking and cache sizes, in this case you can't really do anything clever, so with some proper @simd
and @inbounds
you should have basically the optimal code.
IIRC: My mental model suggested that the performance difference should go the other way. Specifically, the for i in j-1:-1:1
inner loop order should yield sequential (if 'backwards') access of A
and b
, whereas the for i in 1:j-1
inner loop order should involve size-j
access jumps of A
and b
on each outer loop iteration. (Though that may well be incorrect.) Then on the particular problem / hardware I tested, the counterintuitive performance difference between the inner loop orderings seemed small. So I conjectured the performance difference was a mystery of the hardware/compiler and volatile, left the inner loop in the intuitive order, and moved on. Ref. #14471 and #14475 for background on that commit. Will have a closer look at this later today and report back. Best!
I observe a performance difference similar to that shown above (on the machine I tested with previously, and through the problem size I tested previously, and with tests of that problem size completing in roughly the same time they did previously). Present benchmark:
function fwdnaivesub!(A::UpperTriangular, b::AbstractVector, x::AbstractVector = b)
n = size(A, 2)
@inbounds for j in n:-1:1
A.data[j,j] == zero(A.data[j,j]) && throw(SingularException(j))
xj = x[j] = A.data[j,j] \ b[j]
for i in 1:(j-1) # counterintuitively 1:j-1 performs slightly better
b[i] -= A.data[i,j] * xj
end
end
x
end
function bwdnaivesub!(A::UpperTriangular, b::AbstractVector, x::AbstractVector = b)
n = size(A, 2)
@inbounds for j in n:-1:1
A.data[j,j] == zero(A.data[j,j]) && throw(SingularException(j))
xj = x[j] = A.data[j,j] \ b[j]
for i in j-1:-1:1 # counterintuitively 1:j-1 performs slightly better
b[i] -= A.data[i,j] * xj
end
end
x
end
using BenchmarkTools
for n in (10, 100, 1000, 5000)
A = UpperTriangular(eye(n) + randn(n,n)/(2n))
bo = rand(n)
b = similar(bo)
x = similar(bo)
copy!(b, bo); fwdipbench = @benchmark fwdnaivesub!($A, $b)
copy!(b, bo); bwdipbench = @benchmark bwdnaivesub!($A, $b)
copy!(b, bo); fwdopbench = @benchmark fwdnaivesub!($A, $b, $x)
copy!(b, bo); bwdopbench = @benchmark bwdnaivesub!($A, $b, $x)
print("$n x $n UpperTriangular solve, in place, bwd/fwd, ")
println(judge(median(bwdipbench), median(fwdipbench)))
print("$n x $n UpperTriangular solve, out of place, bwd/fwd, ")
println(judge(median(bwdopbench), median(fwdopbench)))
print("$n x $n UpperTriangular solve, bwd, in place / out of place, ")
println(judge(median(bwdipbench), median(bwdopbench)))
println()
end
Results:
10 x 10 UpperTriangular solve, in place, bwd/fwd, BenchmarkTools.TrialJudgement:
time: +96.59% => regression (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
10 x 10 UpperTriangular solve, out of place, bwd/fwd, BenchmarkTools.TrialJudgement:
time: +59.29% => regression (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
10 x 10 UpperTriangular solve, bwd, in place / out of place, BenchmarkTools.TrialJudgement:
time: -3.89% => invariant (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
100 x 100 UpperTriangular solve, in place, bwd/fwd, BenchmarkTools.TrialJudgement:
time: +147.21% => regression (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
100 x 100 UpperTriangular solve, out of place, bwd/fwd, BenchmarkTools.TrialJudgement:
time: +143.52% => regression (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
100 x 100 UpperTriangular solve, bwd, in place / out of place, BenchmarkTools.TrialJudgement:
time: -0.17% => invariant (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
1000 x 1000 UpperTriangular solve, in place, bwd/fwd, BenchmarkTools.TrialJudgement:
time: +69.07% => regression (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
1000 x 1000 UpperTriangular solve, out of place, bwd/fwd, BenchmarkTools.TrialJudgement:
time: +68.34% => regression (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
1000 x 1000 UpperTriangular solve, bwd, in place / out of place, BenchmarkTools.TrialJudgement:
time: -0.23% => invariant (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
5000 x 5000 UpperTriangular solve, in place, bwd/fwd, BenchmarkTools.TrialJudgement:
time: +64.42% => regression (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
5000 x 5000 UpperTriangular solve, out of place, bwd/fwd, BenchmarkTools.TrialJudgement:
time: +64.63% => regression (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
5000 x 5000 UpperTriangular solve, bwd, in place / out of place, BenchmarkTools.TrialJudgement:
time: -0.20% => invariant (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
In contrast to https://github.com/JuliaLang/julia/pull/14475#issue-123723874, the LLVM IR / native code for the two inner loop orderings now differ substantially. The LLVM IR suggests that the forward iteration vectorizes more easily.
@simd
decoration improves performance of both inner loop orderings on this machine (Core i7-3520M), though backward iteration still lags significantly behind forward iteration. Loose working theory: The non-packed nature of the underlying storage renders the intuition I expressed above incorrect (access jumps occurring with either ordering), and the compiler has an easier time reasoning about (and consequently generates better code for) the forward iteration. Testing a packed storage format would be interesting. @simd
decoration benchmark:
function fwdnaivesub!(A::UpperTriangular, b::AbstractVector, x::AbstractVector = b)
n = size(A, 2)
for j in n:-1:1
@inbounds A.data[j,j] == zero(A.data[j,j]) && throw(SingularException(j))
@inbounds xj = x[j] = A.data[j,j] \ b[j]
for i in 1:(j-1) # counterintuitively 1:j-1 performs slightly better
@inbounds b[i] -= A.data[i,j] * xj
end
end
x
end
function fwdsimdnaivesub!(A::UpperTriangular, b::AbstractVector, x::AbstractVector = b)
n = size(A, 2)
for j in n:-1:1
@inbounds A.data[j,j] == zero(A.data[j,j]) && throw(SingularException(j))
@inbounds xj = x[j] = A.data[j,j] \ b[j]
@simd for i in 1:(j-1)
@inbounds b[i] -= A.data[i,j] * xj
end
end
x
end
function bwdnaivesub!(A::UpperTriangular, b::AbstractVector, x::AbstractVector = b)
n = size(A, 2)
for j in n:-1:1
@inbounds A.data[j,j] == zero(A.data[j,j]) && throw(SingularException(j))
@inbounds xj = x[j] = A.data[j,j] \ b[j]
for i in j-1:-1:1
@inbounds b[i] -= A.data[i,j] * xj
end
end
x
end
function bwdsimdnaivesub!(A::UpperTriangular, b::AbstractVector, x::AbstractVector = b)
n = size(A, 2)
for j in n:-1:1
@inbounds A.data[j,j] == zero(A.data[j,j]) && throw(SingularException(j))
@inbounds xj = x[j] = A.data[j,j] \ b[j]
@simd for i in j-1:-1:1
@inbounds b[i] -= A.data[i,j] * xj
end
end
x
end
using BenchmarkTools
for n in (10, 100, 1000, 5000, 10000, 15000)
A = UpperTriangular(eye(n) + randn(n,n)/(2n))
bo = rand(n)
b = similar(bo)
x = similar(bo)
copy!(b, bo); fwdipbench = @benchmark fwdnaivesub!($A, $b)
copy!(b, bo); bwdipbench = @benchmark bwdnaivesub!($A, $b)
copy!(b, bo); fwdsimdipbench = @benchmark fwdsimdnaivesub!($A, $b)
copy!(b, bo); bwdsimdipbench = @benchmark bwdsimdnaivesub!($A, $b)
print("$n x $n UpperTriangular solve, in place, simdbwd/bwd, ")
println(judge(median(bwdsimdipbench), median(bwdipbench)))
print("$n x $n UpperTriangular solve, in place, simdfwd/fwd, ")
println(judge(median(fwdsimdipbench), median(fwdipbench)))
print("$n x $n UpperTriangular solve, in place, fwd/bwd, ")
println(judge(median(fwdipbench), median(bwdipbench)))
print("$n x $n UpperTriangular solve, in place, simdfwd/simdbwd, ")
println(judge(median(fwdsimdipbench), median(bwdsimdipbench)))
println()
end
Results:
10 x 10 UpperTriangular solve, in place, simdbwd/bwd, BenchmarkTools.TrialJudgement:
time: -1.90% => invariant (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
10 x 10 UpperTriangular solve, in place, simdfwd/fwd, BenchmarkTools.TrialJudgement:
time: -1.00% => invariant (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
10 x 10 UpperTriangular solve, in place, fwd/bwd, BenchmarkTools.TrialJudgement:
time: -36.82% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
10 x 10 UpperTriangular solve, in place, simdfwd/simdbwd, BenchmarkTools.TrialJudgement:
time: -36.25% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
100 x 100 UpperTriangular solve, in place, simdbwd/bwd, BenchmarkTools.TrialJudgement:
time: -5.08% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
100 x 100 UpperTriangular solve, in place, simdfwd/fwd, BenchmarkTools.TrialJudgement:
time: -20.46% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
100 x 100 UpperTriangular solve, in place, fwd/bwd, BenchmarkTools.TrialJudgement:
time: -59.00% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
100 x 100 UpperTriangular solve, in place, simdfwd/simdbwd, BenchmarkTools.TrialJudgement:
time: -65.65% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
1000 x 1000 UpperTriangular solve, in place, simdbwd/bwd, BenchmarkTools.TrialJudgement:
time: -22.86% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
1000 x 1000 UpperTriangular solve, in place, simdfwd/fwd, BenchmarkTools.TrialJudgement:
time: -11.52% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
1000 x 1000 UpperTriangular solve, in place, fwd/bwd, BenchmarkTools.TrialJudgement:
time: -40.99% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
1000 x 1000 UpperTriangular solve, in place, simdfwd/simdbwd, BenchmarkTools.TrialJudgement:
time: -32.31% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
5000 x 5000 UpperTriangular solve, in place, simdbwd/bwd, BenchmarkTools.TrialJudgement:
time: -25.34% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
5000 x 5000 UpperTriangular solve, in place, simdfwd/fwd, BenchmarkTools.TrialJudgement:
time: -4.28% => invariant (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
5000 x 5000 UpperTriangular solve, in place, fwd/bwd, BenchmarkTools.TrialJudgement:
time: -38.40% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
5000 x 5000 UpperTriangular solve, in place, simdfwd/simdbwd, BenchmarkTools.TrialJudgement:
time: -21.03% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
10000 x 10000 UpperTriangular solve, in place, simdbwd/bwd, BenchmarkTools.TrialJudgement:
time: -25.90% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
10000 x 10000 UpperTriangular solve, in place, simdfwd/fwd, BenchmarkTools.TrialJudgement:
time: -2.34% => invariant (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
10000 x 10000 UpperTriangular solve, in place, fwd/bwd, BenchmarkTools.TrialJudgement:
time: -37.49% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
10000 x 10000 UpperTriangular solve, in place, simdfwd/simdbwd, BenchmarkTools.TrialJudgement:
time: -17.61% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
15000 x 15000 UpperTriangular solve, in place, simdbwd/bwd, BenchmarkTools.TrialJudgement:
time: -26.47% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
15000 x 15000 UpperTriangular solve, in place, simdfwd/fwd, BenchmarkTools.TrialJudgement:
time: -5.25% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
15000 x 15000 UpperTriangular solve, in place, fwd/bwd, BenchmarkTools.TrialJudgement:
time: -37.86% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
15000 x 15000 UpperTriangular solve, in place, simdfwd/simdbwd, BenchmarkTools.TrialJudgement:
time: -19.93% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
I also don't understand your initial intuition about the jumps in every outer loop. Ok, in forward ordering, there is a memory jump from the first two statements in the outer loop to the start of the inner loop. But then from the inner loop to the first statement of the next outer iteration is not really a jump. In the backward iteration, there is no jump from the statements in the outer loop to the inner loop, but then there is a jump from the end of the inner loop to the first statement in the next iteration of the outer loop. So both strategies probably have a comparable memory jump per outer iteration, but the forward iteration is favorable for other optimizations.
Thinking more carefully via writing:
Forward inner loop iteration, in place
On entering the outer loop, we read A[j,j]
and read/write b[j]
. The last-touched entries were A[j,j+1]
and b[j]
, so we incur: (1) no jump in b
; and (2) a length-N
backward jump through A
where the storage underlying A
is not packed, and a length-j
backward jump through A
where the storage underlying A
is packed.
On entering the inner loop, we then read A[1,j]
and read/write b[1]
, incurring length-j-1
jumps backward through both A
and b
. We then iterate sequentially through A
and b
till A[j-1,j]
and b[j-1]
, incurring no additional jumps.
So for each outer loop iteration, forward inner iteration involves one length-N
(length-j
packed) jump backward through A
, one length-j-1
jump backward through A
, and one length-j-1
jump backward through b
.
Backward inner loop iteration, in place
On entering the outer loop, we read A[j,j]
and read/write b[j]
. The last-touched entries were A[1,j+1]
and b[1]
, so we incur: (1) a length-j-1
forward jump through b
; and (2) a length-N-j+1
backward jump through A
where the storage underlying A
is not packed, and sequential (backward) access through A
where the storage underlying A
is packed.
On entering the inner loop, we then read A[j-1,j]
and read/write b[j-1]
, continuing sequential (backward) access through both A
and b
. We then iterate sequentially (backward) through A
and b
till A[1,j]
and b[1]
, incurring no jumps.
So for each outer loop iteration, backward inner iteration involves one length-N-j+1
backward through A
(none for packed underlying storage) and one length-j-1
forward jump through b
.
Overall
Overall the jumps through b
are comparable, and in any case b
should stay relatively high in the cache hierarchy such that those jumps are less costly than jumps through A
. Forward iteration appears to involve two jumps through A
that the machine may have trouble predicting per outer loop iteration, whereas backward iteration appears to involve only one.
I imagine I'm making a mistake somewhere above? Thoughts @jutho? Thanks and best!
This looks ok; certainly more careful and less sloppy than my reasoning (I just assumed the access of A
would be the same to that of b
, thus ignoring its matrix and triangular character). Then I guess there is no good explanation for the observed timings, other than that julia/LLVM is able to optimize the forward iteration better.
The intriguing part is however still the time difference with lapack.
The intriguing part is however still the time difference with lapack.
Agreed, that discrepancy is curious.
Would decorating with @simd
and switching the inner loop order impact your application's performance enough to warrant a PR? If so, I would be happy to prepare such a PR or review one you submit. Best!
I can prepare a PR, but I think the more general question is whether we want to use naivesub!
over Lapack always. Currently, naivesub!
is only used for those eltype
s not supported by Lapack
, like BigFloat
. I have not benchmarked but assume there is no @simd
benefit in that case and the difference between both loop orders is less significant.
Using naivesub!
across the board seems reasonable at first blush. Edge cases of concern? Best!
Howdy! I need to solve a lot of triangular systems (mostly dense arrays) and I'm wondering if these benchmarks you guys did (thank you!) are still accurate for the latest Julia. Can we revive this conversation?
I'm wondering if these benchmarks you guys did (thank you!) are still accurate for the latest Julia. Can we revive this conversation?
Good question! If you have the requisite spare bandwidth, perhaps trial the benchmarks in https://github.com/JuliaLang/julia/issues/18371#issuecomment-246154253 on master? If you make an attempt and run into any challenges, please don't hesitate to ping me on slack for a second pair of eyes! :) Best!
Following @Sacha0's advice, I updated the benchmark code from https://github.com/JuliaLang/julia/issues/18371#issuecomment-246154253 (no change to the functions being benchmarked) as follows:
using LinearAlgebra
function fwdnaivesub!(A::UpperTriangular, b::AbstractVector, x::AbstractVector = b)
n = size(A, 2)
for j in n:-1:1
@inbounds A.data[j,j] == zero(A.data[j,j]) && throw(SingularException(j))
@inbounds xj = x[j] = A.data[j,j] \ b[j]
for i in 1:(j-1) # counterintuitively 1:j-1 performs slightly better
@inbounds b[i] -= A.data[i,j] * xj
end
end
x
end
function fwdsimdnaivesub!(A::UpperTriangular, b::AbstractVector, x::AbstractVector = b)
n = size(A, 2)
for j in n:-1:1
@inbounds A.data[j,j] == zero(A.data[j,j]) && throw(SingularException(j))
@inbounds xj = x[j] = A.data[j,j] \ b[j]
@simd for i in 1:(j-1)
@inbounds b[i] -= A.data[i,j] * xj
end
end
x
end
function bwdnaivesub!(A::UpperTriangular, b::AbstractVector, x::AbstractVector = b)
n = size(A, 2)
for j in n:-1:1
@inbounds A.data[j,j] == zero(A.data[j,j]) && throw(SingularException(j))
@inbounds xj = x[j] = A.data[j,j] \ b[j]
for i in j-1:-1:1
@inbounds b[i] -= A.data[i,j] * xj
end
end
x
end
function bwdsimdnaivesub!(A::UpperTriangular, b::AbstractVector, x::AbstractVector = b)
n = size(A, 2)
for j in n:-1:1
@inbounds A.data[j,j] == zero(A.data[j,j]) && throw(SingularException(j))
@inbounds xj = x[j] = A.data[j,j] \ b[j]
@simd for i in j-1:-1:1
@inbounds b[i] -= A.data[i,j] * xj
end
end
x
end
using BenchmarkTools
using Statistics
for n in (10, 100, 1000, 5000, 10000, 15000)
println("$n x $n UpperTriangular solve, in place")
A = UpperTriangular(Matrix{Float64}(I, n, n) + randn(n,n)/(2n))
bo = rand(n)
b = similar(bo)
x = similar(bo)
copyto!(b, bo); fwdipbench = @benchmark fwdnaivesub!($A, $b)
copyto!(b, bo); bwdipbench = @benchmark bwdnaivesub!($A, $b)
copyto!(b, bo); fwdsimdipbench = @benchmark fwdsimdnaivesub!($A, $b)
copyto!(b, bo); bwdsimdipbench = @benchmark bwdsimdnaivesub!($A, $b)
println("simdbwd/bwd:")
println(judge(median(bwdsimdipbench), median(bwdipbench)))
println("simdfwd/fwd:")
println(judge(median(fwdsimdipbench), median(fwdipbench)))
println("fwd/bwd:")
println(judge(median(fwdipbench), median(bwdipbench)))
println("simdfwd/simdbwd:")
println(judge(median(fwdsimdipbench), median(bwdsimdipbench)))
println()
end
On:
Julia Version 1.1.0-DEV.115
Commit eb8a9333b0 (2018-08-24 19:12 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, sandybridge)
I get the following results (note the simd vs not-simd results for smaller matrices were highly variable - sometimes +50%, sometimes -50%):
10 x 10 UpperTriangular solve, in place
simdbwd/bwd:
TrialJudgement(-25.32% => improvement)
simdfwd/fwd:
TrialJudgement(+1.03% => invariant)
fwd/bwd:
TrialJudgement(-15.76% => improvement)
simdfwd/simdbwd:
TrialJudgement(+13.95% => regression)
100 x 100 UpperTriangular solve, in place
simdbwd/bwd:
TrialJudgement(+20.65% => regression)
simdfwd/fwd:
TrialJudgement(+9.67% => regression)
fwd/bwd:
TrialJudgement(-49.51% => improvement)
simdfwd/simdbwd:
TrialJudgement(-54.11% => improvement)
1000 x 1000 UpperTriangular solve, in place
simdbwd/bwd:
TrialJudgement(+8.28% => regression)
simdfwd/fwd:
TrialJudgement(+0.57% => invariant)
fwd/bwd:
TrialJudgement(-31.61% => improvement)
simdfwd/simdbwd:
TrialJudgement(-36.48% => improvement)
5000 x 5000 UpperTriangular solve, in place
simdbwd/bwd:
TrialJudgement(+11.16% => regression)
simdfwd/fwd:
TrialJudgement(+0.08% => invariant)
fwd/bwd:
TrialJudgement(-19.51% => improvement)
simdfwd/simdbwd:
TrialJudgement(-27.54% => improvement)
10000 x 10000 UpperTriangular solve, in place
simdbwd/bwd:
TrialJudgement(+2.42% => invariant)
simdfwd/fwd:
TrialJudgement(-0.37% => invariant)
fwd/bwd:
TrialJudgement(-20.56% => improvement)
simdfwd/simdbwd:
TrialJudgement(-22.73% => improvement)
15000 x 15000 UpperTriangular solve, in place
simdbwd/bwd:
TrialJudgement(+1.58% => invariant)
simdfwd/fwd:
TrialJudgement(-0.63% => invariant)
fwd/bwd:
TrialJudgement(-20.46% => improvement)
simdfwd/simdbwd:
TrialJudgement(-22.19% => improvement)
I've decided to add LAPACK.trtrs!
(which is currently used by ldiv!
) and BLAS.trsv!
to the list (because of https://discourse.julialang.org/t/why-lapack-trtrs-not-blas-trsv/14239):
for n in (10, 100, 1000, 5000, 10000, 15000)
println("$n x $n UpperTriangular solve, in place")
A = UpperTriangular(Matrix{Float64}(I, n, n) + randn(n,n)/(2n))
bo = rand(n)
b = similar(bo)
x = similar(bo)
fwdipbench = @benchmark fwdnaivesub!($A, $b) setup = copyto!($b, $bo);
bwdipbench = @benchmark bwdnaivesub!($A, $b) setup = copyto!($b, $bo);
fwdsimdipbench = @benchmark fwdsimdnaivesub!($A, $b) setup = copyto!($b, $bo);
bwdsimdipbench = @benchmark bwdsimdnaivesub!($A, $b) setup = copyto!($b, $bo);
lapackbench = @benchmark LAPACK.trtrs!('U','N','N',$(A.data), $b) setup = copyto!($b, $bo);
blasbench = @benchmark BLAS.trsv!('U','N','N',$(A.data), $b) setup = copyto!($b, $bo);
println("simdbwd/bwd:")
println(judge(median(bwdsimdipbench), median(bwdipbench)))
println("simdfwd/fwd:")
println(judge(median(fwdsimdipbench), median(fwdipbench)))
println("fwd/bwd:")
println(judge(median(fwdipbench), median(bwdipbench)))
println("simdfwd/simdbwd:")
println(judge(median(fwdsimdipbench), median(bwdsimdipbench)))
println("simdfwd/lapack:")
println(judge(median(fwdsimdipbench), median(lapackbench)))
println("simdfwd/blas:")
println(judge(median(fwdsimdipbench), median(blasbench)))
println()
end
resulting in:
10 x 10 UpperTriangular solve, in place
simdbwd/bwd:
TrialJudgement(+35.92% => regression)
simdfwd/fwd:
TrialJudgement(+4.04% => invariant)
fwd/bwd:
TrialJudgement(-37.87% => improvement)
simdfwd/simdbwd:
TrialJudgement(-52.44% => improvement)
simdfwd/lapack:
TrialJudgement(-60.84% => improvement)
simdfwd/blas:
TrialJudgement(-25.75% => improvement)
100 x 100 UpperTriangular solve, in place
simdbwd/bwd:
TrialJudgement(-25.98% => improvement)
simdfwd/fwd:
TrialJudgement(+6.56% => regression)
fwd/bwd:
TrialJudgement(-67.31% => improvement)
simdfwd/simdbwd:
TrialJudgement(-52.94% => improvement)
simdfwd/lapack:
TrialJudgement(-53.70% => improvement)
simdfwd/blas:
TrialJudgement(+0.57% => invariant)
1000 x 1000 UpperTriangular solve, in place
simdbwd/bwd:
TrialJudgement(+2.29% => invariant)
simdfwd/fwd:
TrialJudgement(+2.24% => invariant)
fwd/bwd:
TrialJudgement(-32.52% => improvement)
simdfwd/simdbwd:
TrialJudgement(-32.55% => improvement)
simdfwd/lapack:
TrialJudgement(-63.49% => improvement)
simdfwd/blas:
TrialJudgement(+22.04% => regression)
5000 x 5000 UpperTriangular solve, in place
simdbwd/bwd:
TrialJudgement(-2.44% => invariant)
simdfwd/fwd:
TrialJudgement(+0.53% => invariant)
fwd/bwd:
TrialJudgement(-22.59% => improvement)
simdfwd/simdbwd:
TrialJudgement(-20.23% => improvement)
simdfwd/lapack:
TrialJudgement(-49.64% => improvement)
simdfwd/blas:
TrialJudgement(+29.63% => regression)
10000 x 10000 UpperTriangular solve, in place
simdbwd/bwd:
TrialJudgement(+0.62% => invariant)
simdfwd/fwd:
TrialJudgement(-7.18% => improvement)
fwd/bwd:
TrialJudgement(-17.90% => improvement)
simdfwd/simdbwd:
TrialJudgement(-24.27% => improvement)
simdfwd/lapack:
TrialJudgement(-51.71% => improvement)
simdfwd/blas:
TrialJudgement(+12.75% => regression)
15000 x 15000 UpperTriangular solve, in place
simdbwd/bwd:
TrialJudgement(-2.38% => invariant)
simdfwd/fwd:
TrialJudgement(-4.20% => invariant)
fwd/bwd:
TrialJudgement(-15.09% => improvement)
simdfwd/simdbwd:
TrialJudgement(-16.67% => improvement)
simdfwd/lapack:
TrialJudgement(-45.85% => improvement)
simdfwd/blas:
TrialJudgement(+24.55% => regression)
So the current ldiv!
is at least twice too slow over the whole range of different sizes (small and big). I think this really needs to be changed.
Maybe somebody with MKL can run the same simulation (I should have added versioninfo to the previous post, but that was using OpenBLAS).
In julia's native method for solving upper triangular systems (naivesub!), on line 930 in base/linalg/triangular.jl, it is stated:
I decided to test this with the following code
I think the results are best illustrated on a plot, showing timings divided by
n^2
, wherey1
is lapack,y2
isnaivesub!
andy3
is naivesub2!.So it seems, slightly better means, a whole lot better. This raises two questions: 1) Why was the less performant loop order in
naivesub!
chosen? 2) Why even bother using lapack?naivesub2
seems to be roughly a factor 2 faster.PS: my
versioninfo()